H.264 hardware decoding in Mac OS X

ArsTechnica recently asked Adobe if we would use the recently added video acceleration API in Mac OS X 10.6.3. The answer was yes and today we are making it available as a beta version (it was also a feature included in the release candidates builds we have been making available on labs.adobe.com since April 7th 2010, though not turned on; more details at the end).

Background

The primary reason this API exists is because we have been working with Apple to come up with a way to reduce power consumption on Macs. As we see more and more HD content on the web it is critical that we take advantage of hardware resources when available. In that context it is important to understand that this API targets HD content, not SD or smaller sized video. In fact SD sized content will not be accelerated in most cases. The decision of what content is accelerated and on which machine it is supported is up to Apple.

Machine and OS support

The new video acceleration API is only available in Mac OS X 10.6.3 or later and is limited to GPUs models such as NVIDIA GeForce 9400M, GeForce 320M or GeForce GT 330M. For more details you can look at Apple’sĀ technote. Here is a list of the Mac models currently supported:

  • MacBooks shipped after January 21st, 2009
  • Mac Minis shipped after March 3rd, 2009
  • MacBook Pros shipped after October 14th, 2008
  • iMacs which shipped after the first quarter of 2009

(Mac Pros are not supported as of today)

How do I know that it is working?

After you install the new beta version of Adobe Flash Player (code named “Gala”) and play a video you will sometimes notice a white rectangle overlaying the video. This is the sign that hardware decoding is currently active.

If the white rectangle is missing, Adobe Flash Player has reverted to software decoding. We will of course remove the white rectangle for the final release.

What are the limitations right now?

  • Some resolutions are not supported. Specifically YouTube does sometimes provide a resolution of 864 * 480 pixels for their 480p content which forces a software fallback.
  • Resolutions smaller than 480 * 320 pixels are not accelerated on NVIDIA GeForceĀ 9400M based Macs. On NVidia GeForce 320M and GeForce GT 330M the threshold can be a bit higher. These choices are picked by Apple and balance power usage of the CPU vs. GPU for their particular hardware. Remember that using the GPU for video decoding does not always result in overall power savings. This is something you can only decide on based on the exact type of hardware combination and the content you are trying to play. Playing video has a fixed baseline cost in the GPU for instance which is not the case when you decode on a CPU.
  • The software decoder in Adobe Flash Player is more forgiving when it comes to improperly encoded video files, it works around many issues. The hardware decoder can not handle some of these cases. You might notice that some videos will have ‘jumpy’ frames, i.e. frames are out of order (we have seen that with some files uploaded to YouTube). This is usually because Composition Time Offsets are not properly set up.
  • The hardware decoder is limited to 2 instances at a time. This limit is total to the system. If you have more than 2 videos open at the time the 3rd one will fall back to software decoding. This is even the case when a video is on a hidden tab (This is another reason that hardware decoding is reserved for high resolutions).
  • In the current release of Mac OS X 10.6.3 hardware accelerated decoding will sometimes stop to function until you restart Safari. We are in process to resolve this issue with Apple. But if you can reproduce this consistently with a specific URL please let us know.

Safari and Performance

Compared to QuickTime based video playback support in Safari 4.0.x on Mac OS X 10.6.3 (or your standalone VLC/QuickTime player that is) there is still room for improvement in Flash Player. We have a good plan on how to proceed, which will allow us to leverage all the hardware resources available to us.

Video playback is generally hardware accelerated on two levels: 1. Decoding H.264 bit streams itself and 2. Displaying & scaling the decoded YUV12 formatted video frames. The new API provided by Apple only covers H.264 decoding and we are well aware that we need to accelerate the display and scaling of video. CAOpenGLLayer is the vehicle for that. We are looking at how we can get this implemented soon, but it’s simply too late to include this into Flash Player 10.1.

Previous release candidates

As some have noticed, previous release candidates we have made available on labs.adobe.com referenced this hardware decoding API provided by Apple. We are not in a position yet to enable this by default (hence the extra beta version we are making available) as this has only seen very limited testing by the engineers. Because of some of the issues I mentioned above, we want to put the hardware acceleration functionality through a full public beta cycle before including it in a final shipping version of Flash Player.

If you decide to install this beta version please let us know if you encounter any issues and file bugs here.

Press any key to continue

The recently released Flash Player 10.1 rc contains a couple of enhancements which are worth of a quick note.

Screen savers and video playback

An annoying behavior in older Flash Player versions is the fact that passively consumed content, video specifically, does not prevent the screen saver from kicking in. After some conversation here internally we think we finally have a good answer how to solve this problem. If all of the following conditions are true the screen saver is prevented from kicking in even if you are not in full screen mode:

1. Video is playing
2. Video is not paused or stopped
3. Video is not buffering
4. Sound is playing
5. Sound has a volume (this makes sure that silent ads do not cause harm)
6. The SWF is currently visible (with some caveats on platforms and browsers, see next paragraph)

So no more tapping the keyboard while you are watching a video!

Determining the visibility of SWFs

In my previous post I have mentioned that we now throttle the player whenever a SWF is not visible. Well, I wish we could make this work consistently. As of today we can not always determine if our instance is visible. There is no standard way of doing this, every browser works slightly different.

Here is the current status:

IE
7/8
Win
Firefox
3.6
Win
Opera
10.1
Win
Safari
4.0.5
Mac
WebKit
nightly
Mac
Firefox
3.6
Mac
Firefox
3.7
Mac
Firefox
3.x
Linux
We know if our SWF instance is on a hidden tab YES YES NO YES YES NO NO NO
We know if our SWF instance is scrolled out of view YES NO NO YES YES NO YES NO

Each time you see a NO we can not throttle SWFs to use less CPU resources. These limitations are dictated by the browser so it will take some time to sort this out.

If you have Flash Player 10.1 rc installed, here is something simple to try out: Go to this page (skip the ad):http://www.nickjr.com/kids-games/ants-adventure.html. Now either put the page on a hidden tab or scroll to the bottom of the page so the SWFs are not visible. In IE on Windows and WebKit nightly on the Mac you should see that the CPU usage drops significantly after a couple of seconds because we throttle the SWFs.

I believe this problem can be easily fixed in Firefox Mac and most of the other browsers going forward. On Linux however this will be much more tricky because of GTK (the framework we have to use). We will probably need a special NPAPI extension and lots of browser changes to make this possible.

Timing it right

Status quo

During the Flash Player 10.1 time frame, I was tasked with taking a look at the timing system we use in the Flash Player. Until now the Flash Player has been using a poll based system. Poll based means that everything which happens in the player is served from a single thread and entry point using a periodic timer which polls the run-time. In pseudo code the top level function in the Flash Player looked like this:

while ( sleep ( 1000/120 milliseconds ) ) {  // Every browser provides a different timer interval  ...  if ( timerPending ) { // AS2 Intervals, AS3 Timers    handleTimers();  }  if ( localConnectionPending ) {    handleLocalConnection();  }  if ( videoFrameDue ) {    decodeVideoFrame();  }  if ( audioBufferEmpty ) {    refillAudioBuffer();  }  if ( nextSWFFrameDue ) {    parseSWFFrame();    if ( actionScriptInSWFFrame ) {      executeActionScript();    }  }  if ( needsToUpdateScreen ) {    updateScreen();  }  ...}

The periodic timer is not driven by the Flash Player, it is driven by the browser. In case of Internet Explorer there is an API for this purpose. In the case of Safari on OS X is it hard coded to 50 frames/sec. Every browser implements this slightly differently and things become very complex quickly once you go into details. This has been causing a lot of frustration among designers who could never count on a consistent cross platform behavior.

Another challenging issue with this approach has been that limiting the periodic timer to the SWF frame rate is not acceptable. The problem becomes more obvious when you think of a SWF with a frame rate of let’s say 8 and play back a video inside which runs at 30 frames/sec. To get good video playback you really need to drive the periodic timer at a very high frequency to get good playback otherwise video frames will appear late. In the end the Flash Player always used the highest frequency available on a particular platform and/or browser environment.

The wrong path

The obvious way to re-architect this is to get rid of the polling and instead design an event based system. The new player code would have looked like this, with different subclasses of a Event base class encapsulating what the polling code had done before:

Event e;while ( e=waitForNextEvent() ) {  e.dispatch();}

This approach failed miserably:

  • CPU usage turned out to be much higher than expected due to the abstraction involved.
  • In some cases the queue would grow unbounded.
  • The queue needed a prioritization scheme which turned out to be almost impossible to tune properly.
  • Most SWF content out there depends on a certain sequence logic. Out of order events broke the majority of the SWFs out there.

It’s not all bad

Back to the drawing board. This time my focus was on the actual problem: The Flash Player polls up to 120 times second even if nothing is happening. Modifying the original code slightly I came up with this:

while ( sleepuntil( nextEventTime  ) OR externalEventOccured() ) {  ...  if ( timerPending ) { // AS2 Intervals, AS3 Timers    handleTimers();    nextEventTime = nextTimerTime();  }  if ( localConnectionPending ) {    handleLocalConnection();    nextEventTime = min(nextEventTime , nextLocalConnectionTime());  }  if ( videoFrameDue ) {    decodeVideoFrame();    nextEventTime = min(nextEventTime , nextVideoFrameTime());  }  if ( audioBufferEmpty ) {    refillAudioBuffer();    nextEventTime = min(nextEventTime , nextAudioRebufferTime());  }  if ( nextSWFFrameDue ) {    parseSWFFrame();    if ( actionScriptInSWFFrame ) {      executeActionScript();    }    nextEventTime = min(nextEventTime , nextFrameTime());  }  if ( needsToUpdateScreen ) {    updateScreen();  }  ...}

This approach is solving several problems:

  • There is no abstraction overhead.
  • In most cases it reduces the polling frequency to a fraction.
  • It is fairly backwards compatible.

More importantly, I replaced the browser timer with a cross platform timer which can wait for a particular time code. This not only yields better cross platform behavior, it also allows us to tune it in a way I could not do before. Which leads me to the most important change you will see in Flash Player 10.1: The way we behave when a SWF is not visible.

Implications for user experience

In Flash Player 10.1 SWFs on hidden tabs are limited resource wise. Whereas they would run at full speed in Flash Player 10.0 and before (note though that we NEVER rendered, we only continued to run ActionScript, audio decoding and video decoding), we now throttle the Flash Player when a SWF instance is not visible. Doing this change was not easy as I had to add many exceptions to avoid breaking old content. Here is a list of some of the new rules:

Visible:

  • SWF frame rates are limited and aligned to jiffies, i.e. 60 frames a second. (Note that Flash Playe 10.1 Beta 3 still has an upper limit of 120 which will be changed before the final release)
  • timers (AS2 Interval and AS3 Timers) are limited and aligned to jiffies.
  • local connections are limited and aligned to jiffies. That means a full round trip from one SWF to another will take at least 33 milliseconds. Some reports we get say it can be up to 40ms.
  • video is NOT aligned to jiffies and can play at any frame rate. This increases video playback fidelity.

Invisible:

  • SWF frame rate is clocked down to 2 frames/sec. No rendering occurs unless the SWF becomes visible again.
  • timers (AS2 Interval and AS3 Timers) are clocked down to 2 a second.
  • local connections are clocked down to 2 a second.
  • video is decoded (not rendered or displayed) using idle CPU time only.
  • For backwards compatibility reasons we override the 2 frames/sec frame rate to 8 frames/sec when audio is playing.

This marks a pretty dramatic change from previous Flash Player releases. It’s one of those changes which are painful for designers and developers but are unavoidable for better user experience. Let me show you a CPU usage comparisons with content running on a hidden tab (test URL was this CPU intensive SWF):

Flash Player 10.0

Flash Player 10.1

In this test case the frame rate in the background tab tab has been reduced to 8 frames/sec as audio effects are playing. If there was no audio the improvement would be even more pronounced. The test machine was a Acer Aspire Revo AR1600.

PS: You’ll notice in the two screen shots that the memory usage shows a quite dramatic difference also. That’s for another blog post.

Core Animation

Amazing, so many things have happened in the Flash Player engineering team over the past year. Lots I would love to talk about. But the purpose of this post is to deep dive into a subject Kevin Lynch touched upon recently, specifically Mac performance and his comment about Core Animation. Whenever performance is mentioned in the context of Flash it gathers a lot of the attention and some of the technical background is lost in the PR.

So what’s the deal with Core Animation in Flash Player 10.1? Let’s look at how Apple’s documentation summarizes what Core Animation does:

Core Animation is an Objective-C framework that combines a high-performance compositing engine with a simple to use animation programming interface.

Sounds like perfect match for Flash does it not? So yes, Flash Player 10.1 is attempting to leverage this framework to work around a few specific technical issues we’ve had in Safari and all other browsers on OS X.

The drawing model jungle on OS X

Before going into more specifics of why we are going towards Core Animation lets get an overview about how plugins on OS X draw into the browser window. There 4 possible ways (compared to one on Windows):

  1. QuickDraw. Default mode used by Opera, older Firefox and Safari versions.
  2. Quartz 2D (a.k.a. Core Graphics). Supported by newer versions of Firefox and Safari.
  3. OpenGL. No browser I know of supports this properly today.
  4. Core Animation. Only available in Safari 4 + OS X 10.6 right now, with caveats in the current version.

In addition to these drawing models designers can embed Flash content in 3 different ways by specifying wmode:

  1. Normal
  2. Opaque
  3. Transparent

Normal means that you can’t have overlapping HTML sitting on top of your SWF, Opaque allows it and Transparent means that the SWF is transparent and underlying HTML content will show through. Taking all these variables into account we come up with these tables which shows when a particular drawing model is used (and subject for change before we release Flash Player 10.1):

Flash Player 10.0:

Safari 4 Firefox 3 Opera 10
Normal Quartz 2D QuickDraw QuickDraw
Opaque Quartz 2D QuickDraw QuickDraw
Transparent Quartz 2D QuickDraw QuickDraw

Flash Player 10.1:

Safari 4 (*) Firefox 3 Opera 10
Normal Core Animation Quartz 2D QuickDraw
Opaque Quartz 2D(**) Quartz 2D QuickDraw
Transparent Quartz 2D(**) Quartz 2D QuickDraw

(*) Actually using nightly builds of WebKit because support for Core Animation is work in progress.
(**) Core Animation is used when the SWF is the front most object on the HTML page.

What are the issues with Quartz 2D?

The basic premise of Quartz 2D as Apple describes it:

Quartz 2D is an advanced, two-dimensional drawing engine available for iPhone application development and to all Mac OS X application environments outside of the kernel. Quartz 2D provides low-level, lightweight 2D rendering with unmatched output fidelity regardless of display or printing device.

Quartz 2D is not designed for multimedia applications, like animation or video playback. That’s where OpenGL, Core Video, Core Animation shine. Safari’s use of Quartz 2D to draw HTML content makes perfect sense as its content is static in most cases. Everything works well until Flash comes into the picture. For instance when the Flash Player plays a SWF using the Quartz 2D drawing model is has to do so with the full involvement of the browser. The sequence of events looks like this (you can follow the stack traces in Shark):

  1. Whenever the Flash Player is ready to display a new frame, the Flash Player requests a refresh of its region using NPN_InvalidateRect.
  2. The browser adds the the rectangle provided by the Flash Player to its dirty region.
  3. The browser traverses its own display list (the HTML DOM) and paints every node which is part of the dirty region.
  4. When the browser finds a node with a Flash Player instance it first draws the HTML background and then posts an event to the Flash Player to tell it that it has to paint over the requested region now.
  5. The Flash Player then finally draws its frame.

So far so good, makes sense I hope. So what’s the technical issue? Think of a fairly complex HTML page, for instance a page with a CSS gradient in the background. Add to add a SWF which runs at 30 frames/sec. You will see that a lot of time is spent in the browser, not in the Flash Player. This is where Core Animation kicks in: step 3 and 4 pretty much go away (as long as the SWF is the top most object).

Core Animation in the Flash Player

Flash Player 10.1 implements the Core Animation drawing model to fix this technical issue, among others. Instead of using a CGImageRef + CGContextDrawImage to get the bits to screen we pass a CAOpenGLLayer to Safari and use an OpenGL texture of type GL_TEXTURE_RECTANGLE_ARB to get our bits to the screen.

The support for the Core Animation drawing model was originally driven by Apple and we have worked feverishly to finish the engineering work on both sides. Yes that’s right: This was and is a joint effort between Apple and Adobe engineers. Given the now almost perfect integration of Core Animation plugins into Safari I hope that future versions of the Flash Player will take advantage of more capabilities of OpenGL. And that without the requirement of setting any special wmode. I am pretty stoked about it.

As of today (2/10/2010) we are getting closer to having it stable enough for public consumption. That means though: You will need Flash Player 10.1, OS X 10.6 and updated version of Safari (or the nightly WebKit build), otherwise you will not see anything.

What difference does it really make?

This is by no means panacea for all performance issues in the Flash Player. Far from it. But it is a small step to a larger goal which is to improve the experience in the browser with the ever more complex web content out there. That said here is a comparison between Flash Player 10.0 and Flash 10.1 using this test case (this only works in Safari). Keep in mind that that is an extreme test case which has little to do with real world web content.

Flash Player 10.0 + nightly WebKit + OS X 10.6

Flash Player 10.1 + nightly WebKit + OS X 10.6

PS: You might have noticed that Core Animation is a Cocoa API. Yes, Flash Player 10.1 is a true Cocoa app now (with a Carbon fallback to support Firefox and Opera which are not Cocoa yet).

64-bits

Today we announced the availability of the Adobe Flash Player browser plugin for x86_64 Linux distributions. It is a preview release which has known bugs but should be fit initial testing by the community.

With this release we are tackling one the most requested feature ever for the Linux version of the Flash Player, even before windowless mode support. My personal hope though is that the constant flood of complaints we get about this every day will finally come to an end.

I’ve shown the 64 bit version of the Linux version to the public a couple of months ago, at the Flashforward 2008 conference and I have spent time stabilizing it since then. Sadly we have received zero contributions to make the open source Tamarin VM work properly in 64-bit mode on Linux which would have possibly allowed us to get this done sooner. The good news for me personally is that I now have a thorough understanding of the different x86-64 calling conventions and the instruction set.

The areas which remain untested and non-functional are the following:

  • Camera
  • Microphone
  • Fullscreen playback using OpenGL
  • Various Flash Player 10 specific features

This initial version has no .rpm or .deb packages and is therefore intended for advanced users. We still have work to do to reduce the number of dependencies of the binary and handle situations where installing this plugin might conflict with packages like nspluginwrapper.

Talking about nspluginwrapper: I strongly suggest not to use it. I know that some distros are thinking of even wrapping 64-bit plugins including Ubuntu with the thought that it will improve security and stability of the browser. This is a very bad idea in the state nspluginwrapper is in today. We have done some internal testing and discovered that several features in the Flash Player are broken when the plugin is wrapped. More importantly performance and user experience is pretty bad when the plugin is wrapped. Why? Lots of data needs to be transfered through IPC channels. I hope that browser vendors will eventually come up with a better architecture to wrap plugins without sacrificing performance, stability and functionality.

Please do report bugs or other issues you find using the right channel. And that means our public bug database:

https://bugs.adobe.com/flashplayer/

Comments on blogs, other web sites or 3rd party bug databases are not tracked by our quality assurance team. You are welcome to cross reference when you submit bugs though.

Now for some random items:

  • All SIMD optimizations which where in the 32bit version have been ported. No exceptions.
  • The 64-bit version is not faster than the 32-bit version as the 32-bit version was already well optimized. Scott Byer explained why some time ago: 64 bits…when?. You will see a big difference though when you get rid of nspluginwrapper.
  • The first release of 64-bit Flash Player code was actually part of Adobe Photoshop Lightroom 64-bit in form of an authplay.dll which is a plugin for applications. Second one was with Adobe Photoshop CS4 64-bit, also as an authplay.dll.
  • The first 64-bit plugin for a browser we ship is this Linux version. Windows and Mac will come later.
  • The 64-bit version of the plugin compiles and runs on FreeBSD 7.0 which I demoed at Flashforward 2008. There are no plans for release yet as it is still rather unstable and will require substantial work to get it ready for public consumption.
  • A debugger version of the 64-bit version is not available yet. When we release it ActionScript 2 debugging will not work due the obsolete protocol which depends on 32bit pointers. ActionScript 3 debugging will be supported.

Audio mixing with Pixel Bender

Time to have some advanced fun with Pixel Bender. Recently someone in the community complained to us that mixing 13 mp3 tracks using the dynamic sound playback feature in Flash Player 10 does not really work. Well, true with the sample project he gave us. Doing dynamic sound playback is generally tricky to get right. I can provide a few tips though.

1. Pick the right mp3 encoding format

It’s important you pick a format which consumes the least amount of CPU time for decoding. Specifically you should always choose 44.1Khz as the sample rate for your mp3 files. Why? The Flash Player will otherwise have to re-sample and filter your audio which takes away precious CPU cycles.

The tricky part here is that mp3 encoders usually pick the sample rate automatically, including Adobe Audition. Especially at a bit rate of 64kb or less it will try to switch to 24Khz or 22Khz. You can override this at least for CBR in Audition using the advanced settings at export time if you need to.

2. Keep things simple

Do all your processing in one function if you can. Function calls are expensive generally. Try to read and write data only once. Ideally your mixing code should look something like this if you use pure ActionScript:

var buffer:Vector.<ByteArray> = new Vector.<ByteArray>(NUM_TRACKS);var sound:Vector.<Sound> = new Vector.<Sound>(NUM_TRACKS);

function onSoundData(sampleDataEvent:SampleDataEvent) : void{  for (var i:int = 0; i <  NUM_TRACKS; i++) {    buffer[i].position = 0;    sound[i].extract(buffer[i], BUFFER_SIZE);     buffer[i].position = 0;  }

  for (var j:int = 0; j < BUFFER_SIZE*2; j++)   {    var val:Number = 0;    for (var k:int = 0; k < NUM_TRACKS; k++)     {      val += buffer[k].readFloat();    }    sampleDataEvent.data.writeFloat(val);  }}

You will notice that you will spend a lot of time in this function. So…

3. Use Pixel Bender to mix sounds

I have talked to some who have tried to use Pixel Bender for audio processing. They had little success most of the time. Truth is, our tools are not ready yet. But with some patience and using the assembler for creating .pbj files I posted recently you can make it happen today.

One problematic issue is that right now the Pixel Bender toolkit is designed to handle image data. What does that mean? The toolkit limits you to float3 and float4 output types right now which is not really what you want. Now you might think you could just use float4. Not so. You will notice that Flash Player 10 has a pretty bad bug which makes it not work when float4 types are used for output. I am really angry we did not catch this sooner, hopefully we can address this bug sooner than later. What you are left with is using pure Pixel Bender assembly code for now which allows you to use a float2 output type.

For my experiment I took the Adobe Audition theme sample project and exported all tracks as separate .mp3 files, 15 in total. Incidentally that is also the maximum amount of inputs you can use for a single shader. The goal was to mix all 15 tracks in real time using the dynamic sound playback feature.

Here is the Pixel Bender assembly code I used to create my .pbj file:

  version 1  name "SoundMixer"  kernel "namespace", "adobe"  kernel "vendor", "Adobe Systems"  kernel "version", 1  kernel "description", "A generic sound mixer with volume control"

  parameter "_OutCoord", float2, f0.rg, in

  texture "track0", t0.rg  texture "track1", t1.rg  texture "track2", t2.rg  texture "track3", t3.rg  texture "track4", t4.rg  texture "track5", t5.rg  texture "track6", t6.rg  texture "track7", t7.rg  texture "track8", t8.rg  texture "track9", t9.rg  texture "track10", t10.rg  texture "track11", t11.rg  texture "track12", t12.rg  texture "track13", t13.rg  texture "track14", t14.rg

  parameter "volume0", float2, f3.rg, in  meta "defaultValue", 1, 1  parameter "volume1", float2, f4.rg, in  meta "defaultValue", 1, 1  parameter "volume2", float2, f5.rg, in  meta "defaultValue", 1, 1  parameter "volume3", float2, f6.rg, in  meta "defaultValue", 1, 1  parameter "volume4", float2, f7.rg, in  meta "defaultValue", 1, 1  parameter "volume5", float2, f8.rg, in  meta "defaultValue", 1, 1  parameter "volume6", float2, f9.rg, in  meta "defaultValue", 1, 1  parameter "volume7", float2, f10.rg, in  meta "defaultValue", 1, 1  parameter "volume8", float2, f11.rg, in  meta "defaultValue", 1, 1  parameter "volume9", float2, f12.rg, in  meta "defaultValue", 1, 1  parameter "volume10", float2, f13.rg, in  meta "defaultValue", 1, 1  parameter "volume11", float2, f14.rg, in  meta "defaultValue", 1, 1  parameter "volume12", float2, f15.rg, in  meta "defaultValue", 1, 1  parameter "volume13", float2, f16.rg, in  meta "defaultValue", 1, 1  parameter "volume14", float2, f17.rg, in  meta "defaultValue", 1, 1

  parameter "output", float2, f1.rg, out

;----------------------------------------------------------

  texn f1.rg, f0.rg, t0  mul f1.rg, f3.rg  texn f2.rg, f0.rg, t1  mul f2.rg, f4.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t2  mul f2.rg, f5.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t3  mul f2.rg, f6.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t4  mul f2.rg, f7.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t5  mul f2.rg, f8.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t6  mul f2.rg, f9.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t7  mul f2.rg, f10.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t8  mul f2.rg, f11.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t9  mul f2.rg, f12.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t10  mul f2.rg, f13.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t11  mul f2.rg, f14.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t12  mul f2.rg, f15.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t13  mul f2.rg, f16.rg  add f1.rg, f2.rg  texn f2.rg, f0.rg, t14  mul f2.rg, f17.rg  add f1.rg, f2.rg

Looks complicated, but in fact this does nothing more that the above ActionScript code, with unrolled loops. As an extra you can control the volume on each track.

To use the shader I wrote this little piece (note that this is incomplete code, it will not compile):

// Create shader[Embed(source="mixer.pbj", mimeType="application/octet-stream")]var MixerShader:Class;var mixerShader:Shader = new Shader(new MixerShader());

// buffers will become shader inputsvar buffer:Vector.<ByteArray> = new Vector.<ByteArray>(15);// volume control volume on each track, 1.0 is full volumevar volume:Vector.<Number> = new Vector.<Number>(15);

// initialize the shader inputs and volume valuesfor (var j:int = 0; j < 15; j++) {  volume[j]=1.0;  buffer[j] = new ByteArray();  // set so shader will always work even we have not enough tracks  buffer[j].length = BUFFER_SIZE*4*2;   mixerShader.data["track"+j]["width"] = 1024;  mixerShader.data["track"+j]["height"] = BUFFER_SIZE/1024;  mixerShader.data["track"+j]["input"] = buffer[j];}

function onSoundData(e:SampleDataEvent) : void{  // extract the mp3 data into our shader inputs  for (var i:int = 0; i < NUM_TRACKS; i++) {    buffer[i].position = 0;    sounds[i].extract(buffer[i], BUFFER_SIZE);     buffer[i].position = 0;  }  // update the volume value in the shader  for (var k:int = 0; k < NUM_TRACKS; k++)  {    mixerShader.data["volume"+k]["value"] = [ volume[k], volume[k] ];  }  // mix!  var mixerJob:ShaderJob = new ShaderJob(mixerShader, e.data, 1024, BUFFER_SIZE/1024);  mixerJob.start(true);}

Compared to the pure AS3 version this runs twice as fast overall. On my Core 2 Mac mixing the 15 tracks consumes about 24% of one CPU. So if you are doing simple audio mixing like this Pixel Bender is a good choice. YMMV depending on what application we are talking about and how much processing you need to do on the audio.

Pixel Bender .pbj files

If you have been playing with Pixel Bender in Flash Player 10 you know the workflow:

  • Create your .pbk in the Pixel Bender Toolkit.
  • Export a .pbj binary from the Pixel Bender Toolkit.
  • Embed or dynamically load the .pbj file in your ActionScript.

There is still some mystery around .pbj files, i.e. the file format is neither documented nor is it clear what exactly it contains. While I can’t offer documentation on the file format at this time (although that will happen eventually) what I can offer is an assembler and disassembler I quickly hacked together. I am mostly using for this my own debugging purposes. For those who want to tweak Pixel Bender in Flash to the max this is a really good way to go.

Please note that this is neither officially supported by Adobe nor do I guarantee any correctness or completeness of these two tools. The binary format could change at any time, you should not rely on it. I am really throwing this out to the world for educational purposes. There is no documentation on the syntax or format nor how to put together kernels from scratch. It’s really up to you to make any sense out of it and I do not expect that any Adobe will ever use this syntax since I just made it up myself. Since these are a quite advanced tools I also assume you know how to compile C++ command line tools yourself ;-)

Enough disclaimers, here is the meat:

http://www.kaourantin.net/source/pbjtools/apbj.cpp
http://www.kaourantin.net/source/pbjtools/dpbj.cpp

But for the lazy I have pre-compiled two Windows binaries of these two command line tools. On OSX you can simply compile these using ‘g++ apbj.cpp -o apbj’ and ‘g++ dpbj.cpp -o dpbj’ if you have the developer tools installed on your system. Here are the Windows binaries:

http://www.kaourantin.net/source/pbjtools/apbj.zip
http://www.kaourantin.net/source/pbjtools/dpbj.zip

It should be fairly clear on how to use these if you run them in a command prompt. One of the goals was to allow perfect round tripping, i.e. disassemble->assemble->disassemble without any information loss. I hope I succeeded in this. I also know that someone has already put together an ActionScript version of this which is quite easily done.

Here is some sample output of the disassembler, the .pbj file was based on a Pixel Bender kernel created by Mr.doob (and btw, I fixed the bug he noticed in his blog entry ;-) :

 version  1 name  "NewFilter" kernel  "namespace", "Hypnotic" kernel  "vendor", "Mr.doob" kernel  "version", 1 kernel  "description", "Hypnotic effect"

 parameter "_OutCoord", float2, f0.rg, in

 texture  "src", t0

 parameter "dst", float4, f1, out

 parameter "imgSize", float2, f0.ba, in meta  "defaultValue", 512, 512 meta  "minValue", 0, 0 meta  "maxValue", 512, 512

 parameter "center", float2, f2.rg, in meta  "defaultValue", 256, 256 meta  "minValue", 0, 0 meta  "maxValue", 512, 512

 parameter "offset", float2, f2.ba, in

;----------------------------------------------------------

 mov f3.rg, f0.rg sub f3.rg, f2.rg rcp f3.ba, f0.ba mul f3.ba, f3.rg mov f3.rg, f3.ba set f3.b, 3.14159 mov f3.a, f3.g atan2 f3.a, f3.r mov f4.r, f3.a set f3.a, 2 mov f4.g, f3.r pow f4.g, f3.a set f3.a, 2 mov f4.b, f3.g pow f4.b, f3.a mov f3.a, f4.g add f3.a, f4.b sqr f4.g, f3.a mov f3.a, f4.g set f4.g, 0 set f4.b, 0 set f4.a, 0 add f4.g, f2.b add f4.b, f2.a cos f5.r, f4.r rcp f5.g, f3.a mul f5.g, f5.r add f4.g, f5.g sin f5.r, f4.r rcp f5.g, f3.a mul f5.g, f5.r add f4.b, f5.g set f5.r, 1 set f5.g, 0.1 mov f5.b, f3.a pow f5.b, f5.g rcp f5.g, f5.b mul f5.g, f5.r add f4.a, f5.g mul f4.g, f0.b mul f4.b, f0.a set f5.r, 0 ltn f4.g, f5.r mov i1.r, i0.r

 if i1.r

 set f5.r, 0 sub f5.r, f4.g rcp f5.g, f0.b mul f5.g, f5.r ceil f5.r, f5.g mov f5.g, f0.b mul f5.g, f5.r add f4.g, f5.g

 end  

 set f5.r, 0 ltn f4.b, f5.r mov i1.r, i0.r

 if i1.r

 set f5.r, 0 sub f5.r, f4.b rcp f5.g, f0.a mul f5.g, f5.r ceil f5.r, f5.g mov f5.g, f0.a mul f5.g, f5.r add f4.b, f5.g

 end  

 ltn f0.b, f4.g mov i1.r, i0.r

 if i1.r

 rcp f5.r, f0.b mul f5.r, f4.g floor f5.g, f5.r mov f5.r, f0.b mul f5.r, f5.g sub f4.g, f5.r

 end  

 ltn f0.a, f4.b mov i1.r, i0.r

 if i1.r

 rcp f5.r, f0.a mul f5.r, f4.b floor f5.g, f5.r mov f5.r, f0.a mul f5.r, f5.g sub f4.b, f5.r

 end  

 mov f5.r, f4.g mov f5.g, f4.b texn f6, f5.rg, t0 mov f1, f6 mul f1.rgb, f4.aaa

On Performance

With the release of Google Chrome I see blogs and articles blaming the Flash Player for poor performance and somehow linking this to the fact that it is not open source. Time to clarify a few bits. I’ll start with classic comments:

“Flash hogs my CPU!”

1. HTML != Flash

HTML is a static document format. Flash (TM) content is in its core a classic multimedia format and most Flash content is still purely passive media.

What does that mean? When rendering HTML pages CPU usage only peaks for a very short of amount of time, essentially one single frame in Flash terms. After that almost no resources apart from memory are required. If you do not interact with the HTML page at all, no CPU time is required.

How does Flash compare? Most animated Flash content like rich media advertisement continues to use CPU resources to drive animation, video and/or sound. As opposed to static HTML which has exactly 1 frame, Flash content can have an infinite amount of frames which are played back over time.

Flash is great to provide experiences you could not get otherwise. Animation, video and sound are functions the browser does not (yet) provide, or at least they are not used to the same extend yet by designers. Once the browser will be used to play the same type of multimedia content you will face the same resource usage issues. It takes CPU cycles to decode video, sound and render animation. This is just a fact of life, we are however improving how much is used release after release, something benchmarks can back up.

So, there is a fundamental difference in media type. HTML is static, Flash is not. To put it in terms you might be able to understand:

If you take a picture and print it out you use energy only once and then can continue to view the picture forever without consuming any further energy. If you record a movie you will need some form of machine to play it back which will continue consume energy in form of a projector. The Flash Player is a projector.

“You are so full of it, AJAX does not hog my CPU!”

2. AJAX != Flash, but when done correctly AJAX can be the same as Flash

In most practical instances AJAX is used to drive RIAs. Examples include Gmail, Google Maps and many others. One fundamental property of good applications is that they only respond to either network activity or user input. Peak CPU usage is limited to these events. In general, if you do not touch the browser page no CPU time is required.

Compare this again to Flash animations, video and sound which in many cases remain passive experiences with no requirement for external events to drive the content. This will obviously use CPU resources continuously.

Now, it is perfectly possible to implement a Flash RIA application (that usually means using Flex) which uses the same or even less peak CPU than a AJAX RIA and only responds to network and user input. Flash is a flexible multiple paradigm platform, it depends on what the designer/programmer wanted to do. Unfortunately we at Adobe tend to see of a lot of RIAs which do not follow that principle and add lots of moving sugar to their applications which do little to improve actual usability.

Following good coding practices Flash can yield equivalent or better results than AJAX for many types of RIAs. Another benefit is that writing RIAs in Flash is truly cross browser as there is one Flash Player implementation only.

“Bull, SVG and Canvas show that it can be done better”

3. SVG/Canvas != Flash

Have you ever seen SVG or the canvas tag being used to implement anything else than static (1-frame) content? Have you ever seen rich media advertisement done using SVG or the Canvas tag? I mean not some demo page but actual deployed content. If so you will realize that the same resource usage issues apply.

“You are clueless, why does Flash suck up CPU time when it is on a hidden tab?”

4. Easy shortcuts do not work

Believe it or not but we and the browser vendors have tried to disable/pause/stop Flash content when a tab is hidden. The results were disastrous user experience wise to say the least. Disabling Flash to get any benefit CPU resource wise means the following:

  • Sound will have to stop
  • Any network transfer will have to stop
  • ActionScript execution will have to stop

Each one of these affect CPU resource usage and would affect user experience if we would turn it off. However the Flash Player does not render anything if it is on a hidden tab, we only execute the operations mentioned in the above list.

There is one exception to the rendering optimization: WMODE. If you use WMODE the Flash Player has no way of knowing if it is hidden or not and will continue to do a full render. Do not use WMODE. Unfortunately lots of rich media advertisement I see out there continues to enable this for no apparent reason.

“Flash sucks!”

5. You can help to educate web designers so common mistakes are not made

Huge help would be to adopt strict policies especially for rich media advertisement. I like the rules Google has put forward for Flash ads. Quoting:

“Animation Length: Animated ads are restricted to a maximum of 15 seconds (at a 15-20 fps frame rate), after which point they must remain static. These ads must also comply with the other animation policies.”

Personally I would go even further and request the following:

  • After the animation has played no CPU resources should be used, ActionScript should be on a stop() command.
  • Mouse tracking or other event handling is not allowed unless you activate the banner with a mouse click.
  • DO NOT USE WMODE UNLESS YOU ABSOLUTELY NEED TRANSPARENCY! I can’t stress that enough. Given the architecture of plugins there is no way for the Flash Player to know if Flash content is on a hidden tab or not and disable rendering properly. If you use WMODE the Flash Player will continue to suck up CPU cycles as if the tab was visible. In addition WMODE is much slower than the normal mode.

These simple rules would address almost all the complaints we hear about. Adobe has unfortunately only limited influence on what content gets deployed, in this case it is really up to the community to balk at the web sites putting up content which impacts user experience negatively.

——————–

Like with any powerful technology it is easy to shoot yourself in the foot and with the ease of use of Flash that is unfortunately too common.

Despite of that we are working with all browser vendors to improve performance and user experience whenever possible. There are differences between browsers and our goal is to close this gap once and for all. We are for example looking forward to work together with Google to improve Flash performance in Google Chrome.

On our (Adobe) side we are also looking forward to improve Flash performance further. Flash Player 10 for instance is making the first steps towards hardware accelerated rendering which will provide a huge boost in rendering performance. On the scripting side Tamarin-tracing will improve scripting performance dramatically. This is work we share with the Mozilla foundation which will use the same core libraries under the TaceMonkey project. The latest benchmarks are quite remarkable.

Adobe Flash Player 10 pre-release refresh

We just just released another pre-release of Flash Player 10 (build 10.0.1.525). Go get it here and make sure you read the release notes. As a reminder, as we are nearing the release it becomes increasingly difficult for us to address bugs, especially if they are not crashers. If you have backwards compatibility issues (and I almost guarantee you that there will be some which will affect your content) please report them here (registration required) or here (no registration required).

There have been numerous stability and performance improvements. The most important additions are support for WMODE=transparent and V4L2 cameras (which is still work in progress) on Linux which addresses two of the top 3 feature requests on this platform.

If you have followed GUIMark at all you will notice that this version of the player runs this benchmark substantially better on OSX than any previous Flash Player version. It should be up to 3 times faster. How will this affect you? Well, OSX device text rendering got a huge performance boost. If you use lots of device text you will see a big difference. I posted more details in a comment here

Finally the dynamic sound APIs have slightly changed as I announced previously. I will be updating my posts [1][2] later today.

Adobe Pixel Bender in Flash Player 10 Beta

Lee Brimelow has posted a snippet of code showing how to use Adobe Pixel Bender kernels in the Flash Player 10. Time for me to go into details about this feature. As usual there are surprises and unexpected behavior this feature holds. I’ll keep this post without any sample code, but I’ll promise to show some samples soon.

A long time ago, back in Flash Player 8 days we had the idea of adding a generic way to do bitmap filters. Hard coding bitmap filters like we did for Flash Player 8 is not only not flexible, but has the burden of adding huge amounts of native code into the player and having to optimize it for each and every platform. The issue for us has always been how you would author such generic filters. Various ideas were floating around but in the end there was one sticking point: we had no language and no compiler. After Macromedia’s merger with Adobe the Flash Player and the Adobe Pixel Bender team came together and we finally had what we needed: a language and a compiler.

The Pixel Bender runtime in the Flash Player is drastically different from what you find in the Adobe Pixel Bender Toolkit. The only connection the Flash Player has is the byte code which the toolkit does generate, it generates files with the .pbj extension. A .pbj file contains a binary representation of opcodes/instructions of your Pixel Bender kernel, much the same way a .swf contains ActionScript3 byte code. The byte code itself is designed to translate well into a number of different run times, but for this Flash Player release the focus was a software run time.

You heard right, software run time. Pixel Bender kernels do not run using any GPU functionality whatsoever in Flash Player 10.

Take a breath. :-)

Running filters on a GPU has a number of critical limitation. If we would have supported the GPU to render filters in this release we would have had to fall back to software in many cases. Even if you have the right hardware. And then there is the little issue that we only would have enabled this in the ‘gpu’ wmode. So it is critical to have a well performing software fallback; and I mean one which does not suck like some other frameworks which we have tried first (and which I will not mention by name). A good software implementation also means you can reach more customers which simply do not have the required hardware, which is probably 80-90% of the machines connected to the web out there. Lastly this is the only way we can guarantee somewhat consistent results across platforms. Although I have to point out that that you’ll see differences which are the result of compromises to get better performance.

So why did we not just integrate what the Adobe Pixel Bender Toolkit does, which does support GPUs? First, we need to run on 99% of all the machines out there, down to a plain Pentium I with MMX support running at 400Mhz. Secondly, I would hate to see the Flash Player installer grow by 2 or 3 megabytes in download size. That’s not what the Flash Player is about. The software implementation in Flash Player 10 as it stands now clocks in at about 35KB of compressed code. — I am perfectly aware that some filters would get faster by an order of two magnitudes(!) on a GPU. We know that too well and for this release you will have to deal with this limitation. The important thing to take away here is: A kernel which runs well in the toolkit might not run well at all in the Flash Player.

But… I have more news you might not like. ;-) If you ever run a Pixel Bender filter on PowerPC based Mac you will see that it runs about 10 times slower than on an Intel based Mac. For this release we only had time to implement a JIT code engine for Intel based CPUs. On a PowerPC Mac Pixel Bender kernels will run in interpreted mode. I leave it up to you to make a judgment of how this will affect you. All I can say: Be careful when deploying content using Pixel Bender filters, know your viewers.

Now for some more technical details: the JIT for Pixel Bender filters in Flash Player 10 support various instructions sets, down to plain x87 floating point math and up to SSE2 for some operations like texture sampling which take the most amount of time usually. Given the nature of these filters working like shaders, i.e. being embarrassingly parallel, running Pixel Bender kernels scales linearly with amount of CPUs/cores you have on your machine. On an 8-core machine you will usually be limited by memory bandwidth. Here is a CPU readout on my MacPro when I run a filter on a large image (3872×2592 pixels):

There are 4 different ways of using Pixel Bender kernels in the Flash Player. Let me start with most obvious one and come down to the more interesting case:

  • Filters. Use a Pixel Bender kernel as a filter on any DisplayObject. Obvious.
  • Fill. Use a Pixel Bender kernel to define your own fill type. Want a fancy star shaped high quality gradient? A nice UV gradient? Animated fills? No problem.
  • Blend mode. Not happy with the built-in blend modes? Simply build your own.

What about the 4th? Well, as you can see the ones in the list are designed for graphics only. The last one is more powerful than that. Instead of targeting a specific graphics primitive in the Flash Player, you can target BitmapData objects, ByteArrays or Vectors. Not only that but if you use ByteArray or Vector the data you handle are 32-bit floating point numbers for each channel, unlike BitmapData which is limited to 8-bit unsigned integers per channel. In the end this means you can use Pixel Bender kernels to not only do graphics stuff, but generic number crunching. If you can accept the 32-bit floating point limitation.

This 4th way of using Pixel Bender kernels runs completely separate from your main ActionScript code. It runs in separate thread which allows you to keep your UI responsive even if a Pixel Bender kernel takes a very long time to complete. This works fairly similar to a URLLoader. You send a request with all the information, including your source data, output objects, parameters etc. and a while later an event is dispatched telling you that it is finished. This will be great for any application which wants to do heavy processing.

In my next post I show some concrete examples of how you would use these Pixel Bender kernels in these different scenarios. For now I’ll let this information sink in.

What follows are a few random technical nuggets I noted in my specification when it comes to the implementation in the Flash player, highly technical but important to know if you are pushing the limits of this feature:

  • The internal RGB color space of the Flash Player is alpha pre-multiplied and that is what the Pixel Bender kernel gets.
  • Output color values are always clamped against the alpha. This is not the case when the output is a ByteArray or Vector.
  • The maximum native JIT code buffer size for a kernel is 32KB, if you hit this limit which can happen with complex filters the Flash Player falls back to interpreted mode mode like it does in all cases on PowerPC based machines.
  • You can freely mix linear and nearest sampling in your kernel.
  • Maximum coordinate range is 24bit, that means for values outside the range of -4194304..4194303 coordinates will wrap when you sample and not clamp correctly anymore.
  • The linear sampler does sample up to 8bit of sub pixel information, meaning you’ll get a maximum of 256 steps. This is also the case if you sample from a ByteArray with floating point data.
  • Math functions apart from simple multiplication, division, addition and subtracting work slightly differently on different platforms, depending on the C-library implementation or CPU.
  • In the Flash Player 10 beta vecLib is used on OSX for math functions. Slightly different results on OSX are the result. This might change in the final release as the results could be too different to be acceptable. (This is at least one instance where something will be significantly faster on Mac than on PC)
  • The JIT does not do intelligent caching. In the case of fills that means that each new fill will create a new code section and rejit.
  • There are usually 4 separate JIT’d code sections which each handle different total pixel counts, from 1 pixel to 4 pixels at a time. This is required for anti-aliasing as the rasterizer works with single pixel buffers in this case.
  • When an if-else-endif statement is encountered, the JIT switches to scalar mode, i.e. the if-else-ending section will be expanded up to 4 times as scalar code. Anything outside of a if-else-endif block is still processed as vectors. It’s best to move sampling outside of if statements if practical. The final write to the destination is always vectorized.
  • The total number of JIT’d code is limited by the virtual address space. Each code section reserves 128Kbytes (4*32KB) of virtual address space.
  • The first 4 pixels rendered of every instance of a shader is run in interpreted mode, the native code generation is done during that first run. You might get artifacts if you depend on limit values as the interpreted mode uses different math functions. If you are on a multicore system, every new span rendered will create a new instance of a shader, i.e. the code is JITd 8*4 times on a 8-core system. This way the JIT is completely without any locks.