Core Animation

Amazing, so many things have happened in the Flash Player engineering team over the past year. Lots I would love to talk about. But the purpose of this post is to deep dive into a subject Kevin Lynch touched upon recently, specifically Mac performance and his comment about Core Animation. Whenever performance is mentioned in the context of Flash it gathers a lot of the attention and some of the technical background is lost in the PR.

So what’s the deal with Core Animation in Flash Player 10.1? Let’s look at how Apple’s documentation summarizes what Core Animation does:

Core Animation is an Objective-C framework that combines a high-performance compositing engine with a simple to use animation programming interface.

Sounds like perfect match for Flash does it not? So yes, Flash Player 10.1 is attempting to leverage this framework to work around a few specific technical issues we’ve had in Safari and all other browsers on OS X.

The drawing model jungle on OS X

Before going into more specifics of why we are going towards Core Animation lets get an overview about how plugins on OS X draw into the browser window. There 4 possible ways (compared to one on Windows):

  1. QuickDraw. Default mode used by Opera, older Firefox and Safari versions.
  2. Quartz 2D (a.k.a. Core Graphics). Supported by newer versions of Firefox and Safari.
  3. OpenGL. No browser I know of supports this properly today.
  4. Core Animation. Only available in Safari 4 + OS X 10.6 right now, with caveats in the current version.

In addition to these drawing models designers can embed Flash content in 3 different ways by specifying wmode:

  1. Normal
  2. Opaque
  3. Transparent

Normal means that you can’t have overlapping HTML sitting on top of your SWF, Opaque allows it and Transparent means that the SWF is transparent and underlying HTML content will show through. Taking all these variables into account we come up with these tables which shows when a particular drawing model is used (and subject for change before we release Flash Player 10.1):

Flash Player 10.0:

Safari 4 Firefox 3 Opera 10
Normal Quartz 2D QuickDraw QuickDraw
Opaque Quartz 2D QuickDraw QuickDraw
Transparent Quartz 2D QuickDraw QuickDraw

Flash Player 10.1:

Safari 4 (*) Firefox 3 Opera 10
Normal Core Animation Quartz 2D QuickDraw
Opaque Quartz 2D(**) Quartz 2D QuickDraw
Transparent Quartz 2D(**) Quartz 2D QuickDraw

(*) Actually using nightly builds of WebKit because support for Core Animation is work in progress.
(**) Core Animation is used when the SWF is the front most object on the HTML page.

What are the issues with Quartz 2D?

The basic premise of Quartz 2D as Apple describes it:

Quartz 2D is an advanced, two-dimensional drawing engine available for iPhone application development and to all Mac OS X application environments outside of the kernel. Quartz 2D provides low-level, lightweight 2D rendering with unmatched output fidelity regardless of display or printing device.

Quartz 2D is not designed for multimedia applications, like animation or video playback. That’s where OpenGL, Core Video, Core Animation shine. Safari’s use of Quartz 2D to draw HTML content makes perfect sense as its content is static in most cases. Everything works well until Flash comes into the picture. For instance when the Flash Player plays a SWF using the Quartz 2D drawing model is has to do so with the full involvement of the browser. The sequence of events looks like this (you can follow the stack traces in Shark):

  1. Whenever the Flash Player is ready to display a new frame, the Flash Player requests a refresh of its region using NPN_InvalidateRect.
  2. The browser adds the the rectangle provided by the Flash Player to its dirty region.
  3. The browser traverses its own display list (the HTML DOM) and paints every node which is part of the dirty region.
  4. When the browser finds a node with a Flash Player instance it first draws the HTML background and then posts an event to the Flash Player to tell it that it has to paint over the requested region now.
  5. The Flash Player then finally draws its frame.

So far so good, makes sense I hope. So what’s the technical issue? Think of a fairly complex HTML page, for instance a page with a CSS gradient in the background. Add to add a SWF which runs at 30 frames/sec. You will see that a lot of time is spent in the browser, not in the Flash Player. This is where Core Animation kicks in: step 3 and 4 pretty much go away (as long as the SWF is the top most object).

Core Animation in the Flash Player

Flash Player 10.1 implements the Core Animation drawing model to fix this technical issue, among others. Instead of using a CGImageRef + CGContextDrawImage to get the bits to screen we pass a CAOpenGLLayer to Safari and use an OpenGL texture of type GL_TEXTURE_RECTANGLE_ARB to get our bits to the screen.

The support for the Core Animation drawing model was originally driven by Apple and we have worked feverishly to finish the engineering work on both sides. Yes that’s right: This was and is a joint effort between Apple and Adobe engineers. Given the now almost perfect integration of Core Animation plugins into Safari I hope that future versions of the Flash Player will take advantage of more capabilities of OpenGL. And that without the requirement of setting any special wmode. I am pretty stoked about it.

As of today (2/10/2010) we are getting closer to having it stable enough for public consumption. That means though: You will need Flash Player 10.1, OS X 10.6 and updated version of Safari (or the nightly WebKit build), otherwise you will not see anything.

What difference does it really make?

This is by no means panacea for all performance issues in the Flash Player. Far from it. But it is a small step to a larger goal which is to improve the experience in the browser with the ever more complex web content out there. That said here is a comparison between Flash Player 10.0 and Flash 10.1 using this test case (this only works in Safari). Keep in mind that that is an extreme test case which has little to do with real world web content.

Flash Player 10.0 + nightly WebKit + OS X 10.6

Flash Player 10.1 + nightly WebKit + OS X 10.6

PS: You might have noticed that Core Animation is a Cocoa API. Yes, Flash Player 10.1 is a true Cocoa app now (with a Carbon fallback to support Firefox and Opera which are not Cocoa yet).

64-bits

Today we announced the availability of the Adobe Flash Player browser plugin for x86_64 Linux distributions. It is a preview release which has known bugs but should be fit initial testing by the community.

With this release we are tackling one the most requested feature ever for the Linux version of the Flash Player, even before windowless mode support. My personal hope though is that the constant flood of complaints we get about this every day will finally come to an end.

I’ve shown the 64 bit version of the Linux version to the public a couple of months ago, at the Flashforward 2008 conference and I have spent time stabilizing it since then. Sadly we have received zero contributions to make the open source Tamarin VM work properly in 64-bit mode on Linux which would have possibly allowed us to get this done sooner. The good news for me personally is that I now have a thorough understanding of the different x86-64 calling conventions and the instruction set.

The areas which remain untested and non-functional are the following:

  • Camera
  • Microphone
  • Fullscreen playback using OpenGL
  • Various Flash Player 10 specific features

This initial version has no .rpm or .deb packages and is therefore intended for advanced users. We still have work to do to reduce the number of dependencies of the binary and handle situations where installing this plugin might conflict with packages like nspluginwrapper.

Talking about nspluginwrapper: I strongly suggest not to use it. I know that some distros are thinking of even wrapping 64-bit plugins including Ubuntu with the thought that it will improve security and stability of the browser. This is a very bad idea in the state nspluginwrapper is in today. We have done some internal testing and discovered that several features in the Flash Player are broken when the plugin is wrapped. More importantly performance and user experience is pretty bad when the plugin is wrapped. Why? Lots of data needs to be transfered through IPC channels. I hope that browser vendors will eventually come up with a better architecture to wrap plugins without sacrificing performance, stability and functionality.

Please do report bugs or other issues you find using the right channel. And that means our public bug database:

https://bugs.adobe.com/flashplayer/

Comments on blogs, other web sites or 3rd party bug databases are not tracked by our quality assurance team. You are welcome to cross reference when you submit bugs though.

Now for some random items:

  • All SIMD optimizations which where in the 32bit version have been ported. No exceptions.
  • The 64-bit version is not faster than the 32-bit version as the 32-bit version was already well optimized. Scott Byer explained why some time ago: 64 bits…when?. You will see a big difference though when you get rid of nspluginwrapper.
  • The first release of 64-bit Flash Player code was actually part of Adobe Photoshop Lightroom 64-bit in form of an authplay.dll which is a plugin for applications. Second one was with Adobe Photoshop CS4 64-bit, also as an authplay.dll.
  • The first 64-bit plugin for a browser we ship is this Linux version. Windows and Mac will come later.
  • The 64-bit version of the plugin compiles and runs on FreeBSD 7.0 which I demoed at Flashforward 2008. There are no plans for release yet as it is still rather unstable and will require substantial work to get it ready for public consumption.
  • A debugger version of the 64-bit version is not available yet. When we release it ActionScript 2 debugging will not work due the obsolete protocol which depends on 32bit pointers. ActionScript 3 debugging will be supported.

Audio mixing with Pixel Bender

Time to have some advanced fun with Pixel Bender. Recently someone in the community complained to us that mixing 13 mp3 tracks using the dynamic sound playback feature in Flash Player 10 does not really work. Well, true with the sample project he gave us. Doing dynamic sound playback is generally tricky to get right. I can provide a few tips though.

1. Pick the right mp3 encoding format

It’s important you pick a format which consumes the least amount of CPU time for decoding. Specifically you should always choose 44.1Khz as the sample rate for your mp3 files. Why? The Flash Player will otherwise have to re-sample and filter your audio which takes away precious CPU cycles.

The tricky part here is that mp3 encoders usually pick the sample rate automatically, including Adobe Audition. Especially at a bit rate of 64kb or less it will try to switch to 24Khz or 22Khz. You can override this at least for CBR in Audition using the advanced settings at export time if you need to.

2. Keep things simple

Do all your processing in one function if you can. Function calls are expensive generally. Try to read and write data only once. Ideally your mixing code should look something like this if you use pure ActionScript:

var buffer:Vector.<ByteArray> = new Vector.<ByteArray>(NUM_TRACKS);
var sound:Vector.<Sound> = new Vector.<Sound>(NUM_TRACKS);

function onSoundData(sampleDataEvent:SampleDataEvent) : void
{
for (var i:int = 0; i < NUM_TRACKS; i++) {
buffer[i].position = 0;
sound[i].extract(buffer[i], BUFFER_SIZE);
buffer[i].position = 0;
}

for (var j:int = 0; j < BUFFER_SIZE*2; j++)
{
var val:Number = 0;
for (var k:int = 0; k < NUM_TRACKS; k++)
{
val += buffer[k].readFloat();
}
sampleDataEvent.data.writeFloat(val);
}
}

You will notice that you will spend a lot of time in this function. So…

3. Use Pixel Bender to mix sounds

I have talked to some who have tried to use Pixel Bender for audio processing. They had little success most of the time. Truth is, our tools are not ready yet. But with some patience and using the assembler for creating .pbj files I posted recently you can make it happen today.

One problematic issue is that right now the Pixel Bender toolkit is designed to handle image data. What does that mean? The toolkit limits you to float3 and float4 output types right now which is not really what you want. Now you might think you could just use float4. Not so. You will notice that Flash Player 10 has a pretty bad bug which makes it not work when float4 types are used for output. I am really angry we did not catch this sooner, hopefully we can address this bug sooner than later. What you are left with is using pure Pixel Bender assembly code for now which allows you to use a float2 output type.

For my experiment I took the Adobe Audition theme sample project and exported all tracks as separate .mp3 files, 15 in total. Incidentally that is also the maximum amount of inputs you can use for a single shader. The goal was to mix all 15 tracks in real time using the dynamic sound playback feature.

Here is the Pixel Bender assembly code I used to create my .pbj file:


version 1
name "SoundMixer"
kernel "namespace", "adobe"
kernel "vendor", "Adobe Systems"
kernel "version", 1
kernel "description", "A generic sound mixer with volume control"

parameter "_OutCoord", float2, f0.rg, in

texture "track0", t0.rg
texture "track1", t1.rg
texture "track2", t2.rg
texture "track3", t3.rg
texture "track4", t4.rg
texture "track5", t5.rg
texture "track6", t6.rg
texture "track7", t7.rg
texture "track8", t8.rg
texture "track9", t9.rg
texture "track10", t10.rg
texture "track11", t11.rg
texture "track12", t12.rg
texture "track13", t13.rg
texture "track14", t14.rg

parameter "volume0", float2, f3.rg, in
meta "defaultValue", 1, 1
parameter "volume1", float2, f4.rg, in
meta "defaultValue", 1, 1
parameter "volume2", float2, f5.rg, in
meta "defaultValue", 1, 1
parameter "volume3", float2, f6.rg, in
meta "defaultValue", 1, 1
parameter "volume4", float2, f7.rg, in
meta "defaultValue", 1, 1
parameter "volume5", float2, f8.rg, in
meta "defaultValue", 1, 1
parameter "volume6", float2, f9.rg, in
meta "defaultValue", 1, 1
parameter "volume7", float2, f10.rg, in
meta "defaultValue", 1, 1
parameter "volume8", float2, f11.rg, in
meta "defaultValue", 1, 1
parameter "volume9", float2, f12.rg, in
meta "defaultValue", 1, 1
parameter "volume10", float2, f13.rg, in
meta "defaultValue", 1, 1
parameter "volume11", float2, f14.rg, in
meta "defaultValue", 1, 1
parameter "volume12", float2, f15.rg, in
meta "defaultValue", 1, 1
parameter "volume13", float2, f16.rg, in
meta "defaultValue", 1, 1
parameter "volume14", float2, f17.rg, in
meta "defaultValue", 1, 1

parameter "output", float2, f1.rg, out

;----------------------------------------------------------

texn f1.rg, f0.rg, t0
mul f1.rg, f3.rg
texn f2.rg, f0.rg, t1
mul f2.rg, f4.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t2
mul f2.rg, f5.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t3
mul f2.rg, f6.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t4
mul f2.rg, f7.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t5
mul f2.rg, f8.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t6
mul f2.rg, f9.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t7
mul f2.rg, f10.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t8
mul f2.rg, f11.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t9
mul f2.rg, f12.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t10
mul f2.rg, f13.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t11
mul f2.rg, f14.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t12
mul f2.rg, f15.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t13
mul f2.rg, f16.rg
add f1.rg, f2.rg
texn f2.rg, f0.rg, t14
mul f2.rg, f17.rg
add f1.rg, f2.rg

Looks complicated, but in fact this does nothing more that the above ActionScript code, with unrolled loops. As an extra you can control the volume on each track.

To use the shader I wrote this little piece (note that this is incomplete code, it will not compile):


// Create shader
[Embed(source="mixer.pbj", mimeType="application/octet-stream")]
var MixerShader:Class;
var mixerShader:Shader = new Shader(new MixerShader());

// buffers will become shader inputs
var buffer:Vector.<ByteArray> = new Vector.<ByteArray>(15);
// volume control volume on each track, 1.0 is full volume
var volume:Vector.<Number> = new Vector.<Number>(15);

// initialize the shader inputs and volume values
for (var j:int = 0; j < 15; j++) {
volume[j]=1.0;
buffer[j] = new ByteArray();
// set so shader will always work even we have not enough tracks
buffer[j].length = BUFFER_SIZE*4*2;
mixerShader.data["track"+j]["width"] = 1024;
mixerShader.data["track"+j]["height"] = BUFFER_SIZE/1024;
mixerShader.data["track"+j]["input"] = buffer[j];
}

function onSoundData(e:SampleDataEvent) : void
{
// extract the mp3 data into our shader inputs
for (var i:int = 0; i < NUM_TRACKS; i++) {
buffer[i].position = 0;
sounds[i].extract(buffer[i], BUFFER_SIZE);
buffer[i].position = 0;
}
// update the volume value in the shader
for (var k:int = 0; k < NUM_TRACKS; k++) {
mixerShader.data["volume"+k]["value"] = [ volume[k], volume[k] ];
}
// mix!
var mixerJob:ShaderJob = new ShaderJob(mixerShader, e.data, 1024, BUFFER_SIZE/1024);
mixerJob.start(true);
}

Compared to the pure AS3 version this runs twice as fast overall. On my Core 2 Mac mixing the 15 tracks consumes about 24% of one CPU. So if you are doing simple audio mixing like this Pixel Bender is a good choice. YMMV depending on what application we are talking about and how much processing you need to do on the audio.

Pixel Bender .pbj files

If you have been playing with Pixel Bender in Flash Player 10 you know the workflow:

  • Create your .pbk in the Pixel Bender Toolkit.
  • Export a .pbj binary from the Pixel Bender Toolkit.
  • Embed or dynamically load the .pbj file in your ActionScript.

There is still some mystery around .pbj files, i.e. the file format is neither documented nor is it clear what exactly it contains. While I can’t offer documentation on the file format at this time (although that will happen eventually) what I can offer is an assembler and disassembler I quickly hacked together. I am mostly using for this my own debugging purposes. For those who want to tweak Pixel Bender in Flash to the max this is a really good way to go.

Please note that this is neither officially supported by Adobe nor do I guarantee any correctness or completeness of these two tools. The binary format could change at any time, you should not rely on it. I am really throwing this out to the world for educational purposes. There is no documentation on the syntax or format nor how to put together kernels from scratch. It’s really up to you to make any sense out of it and I do not expect that any Adobe will ever use this syntax since I just made it up myself. Since these are a quite advanced tools I also assume you know how to compile C++ command line tools yourself ;-)

Enough disclaimers, here is the meat:

http://www.kaourantin.net/source/pbjtools/apbj.cpp
http://www.kaourantin.net/source/pbjtools/dpbj.cpp

But for the lazy I have pre-compiled two Windows binaries of these two command line tools. On OSX you can simply compile these using ‘g++ apbj.cpp -o apbj’ and ‘g++ dpbj.cpp -o dpbj’ if you have the developer tools installed on your system. Here are the Windows binaries:

http://www.kaourantin.net/source/pbjtools/apbj.zip
http://www.kaourantin.net/source/pbjtools/dpbj.zip

It should be fairly clear on how to use these if you run them in a command prompt. One of the goals was to allow perfect round tripping, i.e. disassemble->assemble->disassemble without any information loss. I hope I succeeded in this. I also know that someone has already put together an ActionScript version of this which is quite easily done.

Here is some sample output of the disassembler, the .pbj file was based on a Pixel Bender kernel created by Mr.doob (and btw, I fixed the bug he noticed in his blog entry ;-) :


version 1
name "NewFilter"
kernel "namespace", "Hypnotic"
kernel "vendor", "Mr.doob"
kernel "version", 1
kernel "description", "Hypnotic effect"

parameter "_OutCoord", float2, f0.rg, in

texture "src", t0

parameter "dst", float4, f1, out

parameter "imgSize", float2, f0.ba, in
meta "defaultValue", 512, 512
meta "minValue", 0, 0
meta "maxValue", 512, 512

parameter "center", float2, f2.rg, in
meta "defaultValue", 256, 256
meta "minValue", 0, 0
meta "maxValue", 512, 512

parameter "offset", float2, f2.ba, in

;----------------------------------------------------------

mov f3.rg, f0.rg
sub f3.rg, f2.rg
rcp f3.ba, f0.ba
mul f3.ba, f3.rg
mov f3.rg, f3.ba
set f3.b, 3.14159
mov f3.a, f3.g
atan2 f3.a, f3.r
mov f4.r, f3.a
set f3.a, 2
mov f4.g, f3.r
pow f4.g, f3.a
set f3.a, 2
mov f4.b, f3.g
pow f4.b, f3.a
mov f3.a, f4.g
add f3.a, f4.b
sqr f4.g, f3.a
mov f3.a, f4.g
set f4.g, 0
set f4.b, 0
set f4.a, 0
add f4.g, f2.b
add f4.b, f2.a
cos f5.r, f4.r
rcp f5.g, f3.a
mul f5.g, f5.r
add f4.g, f5.g
sin f5.r, f4.r
rcp f5.g, f3.a
mul f5.g, f5.r
add f4.b, f5.g
set f5.r, 1
set f5.g, 0.1
mov f5.b, f3.a
pow f5.b, f5.g
rcp f5.g, f5.b
mul f5.g, f5.r
add f4.a, f5.g
mul f4.g, f0.b
mul f4.b, f0.a
set f5.r, 0
ltn f4.g, f5.r
mov i1.r, i0.r

if i1.r

set f5.r, 0
sub f5.r, f4.g
rcp f5.g, f0.b
mul f5.g, f5.r
ceil f5.r, f5.g
mov f5.g, f0.b
mul f5.g, f5.r
add f4.g, f5.g

end

set f5.r, 0
ltn f4.b, f5.r
mov i1.r, i0.r

if i1.r

set f5.r, 0
sub f5.r, f4.b
rcp f5.g, f0.a
mul f5.g, f5.r
ceil f5.r, f5.g
mov f5.g, f0.a
mul f5.g, f5.r
add f4.b, f5.g

end

ltn f0.b, f4.g
mov i1.r, i0.r

if i1.r

rcp f5.r, f0.b
mul f5.r, f4.g
floor f5.g, f5.r
mov f5.r, f0.b
mul f5.r, f5.g
sub f4.g, f5.r

end

ltn f0.a, f4.b
mov i1.r, i0.r

if i1.r

rcp f5.r, f0.a
mul f5.r, f4.b
floor f5.g, f5.r
mov f5.r, f0.a
mul f5.r, f5.g
sub f4.b, f5.r

end

mov f5.r, f4.g
mov f5.g, f4.b
texn f6, f5.rg, t0
mov f1, f6
mul f1.rgb, f4.aaa

On Performance

With the release of Google Chrome I see blogs and articles blaming the Flash Player for poor performance and somehow linking this to the fact that it is not open source. Time to clarify a few bits. I’ll start with classic comments:

“Flash hogs my CPU!”

1. HTML != Flash

HTML is a static document format. Flash (TM) content is in its core a classic multimedia format and most Flash content is still purely passive media.

What does that mean? When rendering HTML pages CPU usage only peaks for a very short of amount of time, essentially one single frame in Flash terms. After that almost no resources apart from memory are required. If you do not interact with the HTML page at all, no CPU time is required.

How does Flash compare? Most animated Flash content like rich media advertisement continues to use CPU resources to drive animation, video and/or sound. As opposed to static HTML which has exactly 1 frame, Flash content can have an infinite amount of frames which are played back over time.

Flash is great to provide experiences you could not get otherwise. Animation, video and sound are functions the browser does not (yet) provide, or at least they are not used to the same extend yet by designers. Once the browser will be used to play the same type of multimedia content you will face the same resource usage issues. It takes CPU cycles to decode video, sound and render animation. This is just a fact of life, we are however improving how much is used release after release, something benchmarks can back up.

So, there is a fundamental difference in media type. HTML is static, Flash is not. To put it in terms you might be able to understand:

If you take a picture and print it out you use energy only once and then can continue to view the picture forever without consuming any further energy. If you record a movie you will need some form of machine to play it back which will continue consume energy in form of a projector. The Flash Player is a projector.

“You are so full of it, AJAX does not hog my CPU!”

2. AJAX != Flash, but when done correctly AJAX can be the same as Flash

In most practical instances AJAX is used to drive RIAs. Examples include Gmail, Google Maps and many others. One fundamental property of good applications is that they only respond to either network activity or user input. Peak CPU usage is limited to these events. In general, if you do not touch the browser page no CPU time is required.

Compare this again to Flash animations, video and sound which in many cases remain passive experiences with no requirement for external events to drive the content. This will obviously use CPU resources continuously.

Now, it is perfectly possible to implement a Flash RIA application (that usually means using Flex) which uses the same or even less peak CPU than a AJAX RIA and only responds to network and user input. Flash is a flexible multiple paradigm platform, it depends on what the designer/programmer wanted to do. Unfortunately we at Adobe tend to see of a lot of RIAs which do not follow that principle and add lots of moving sugar to their applications which do little to improve actual usability.

Following good coding practices Flash can yield equivalent or better results than AJAX for many types of RIAs. Another benefit is that writing RIAs in Flash is truly cross browser as there is one Flash Player implementation only.

“Bull, SVG and Canvas show that it can be done better”

3. SVG/Canvas != Flash

Have you ever seen SVG or the canvas tag being used to implement anything else than static (1-frame) content? Have you ever seen rich media advertisement done using SVG or the Canvas tag? I mean not some demo page but actual deployed content. If so you will realize that the same resource usage issues apply.

“You are clueless, why does Flash suck up CPU time when it is on a hidden tab?”

4. Easy shortcuts do not work

Believe it or not but we and the browser vendors have tried to disable/pause/stop Flash content when a tab is hidden. The results were disastrous user experience wise to say the least. Disabling Flash to get any benefit CPU resource wise means the following:

  • Sound will have to stop
  • Any network transfer will have to stop
  • ActionScript execution will have to stop

Each one of these affect CPU resource usage and would affect user experience if we would turn it off. However the Flash Player does not render anything if it is on a hidden tab, we only execute the operations mentioned in the above list.

There is one exception to the rendering optimization: WMODE. If you use WMODE the Flash Player has no way of knowing if it is hidden or not and will continue to do a full render. Do not use WMODE. Unfortunately lots of rich media advertisement I see out there continues to enable this for no apparent reason.

“Flash sucks!”

5. You can help to educate web designers so common mistakes are not made

Huge help would be to adopt strict policies especially for rich media advertisement. I like the rules Google has put forward for Flash ads. Quoting:

“Animation Length: Animated ads are restricted to a maximum of 15 seconds (at a 15-20 fps frame rate), after which point they must remain static. These ads must also comply with the other animation policies.”

Personally I would go even further and request the following:

  • After the animation has played no CPU resources should be used, ActionScript should be on a stop() command.
  • Mouse tracking or other event handling is not allowed unless you activate the banner with a mouse click.
  • DO NOT USE WMODE UNLESS YOU ABSOLUTELY NEED TRANSPARENCY! I can’t stress that enough. Given the architecture of plugins there is no way for the Flash Player to know if Flash content is on a hidden tab or not and disable rendering properly. If you use WMODE the Flash Player will continue to suck up CPU cycles as if the tab was visible. In addition WMODE is much slower than the normal mode.

These simple rules would address almost all the complaints we hear about. Adobe has unfortunately only limited influence on what content gets deployed, in this case it is really up to the community to balk at the web sites putting up content which impacts user experience negatively.

——————–

Like with any powerful technology it is easy to shoot yourself in the foot and with the ease of use of Flash that is unfortunately too common.

Despite of that we are working with all browser vendors to improve performance and user experience whenever possible. There are differences between browsers and our goal is to close this gap once and for all. We are for example looking forward to work together with Google to improve Flash performance in Google Chrome.

On our (Adobe) side we are also looking forward to improve Flash performance further. Flash Player 10 for instance is making the first steps towards hardware accelerated rendering which will provide a huge boost in rendering performance. On the scripting side Tamarin-tracing will improve scripting performance dramatically. This is work we share with the Mozilla foundation which will use the same core libraries under the TaceMonkey project. The latest benchmarks are quite remarkable.

Adobe Flash Player 10 pre-release refresh

We just just released another pre-release of Flash Player 10 (build 10.0.1.525). Go get it here and make sure you read the release notes. As a reminder, as we are nearing the release it becomes increasingly difficult for us to address bugs, especially if they are not crashers. If you have backwards compatibility issues (and I almost guarantee you that there will be some which will affect your content) please report them here (registration required) or here (no registration required).

There have been numerous stability and performance improvements. The most important additions are support for WMODE=transparent and V4L2 cameras (which is still work in progress) on Linux which addresses two of the top 3 feature requests on this platform.

If you have followed GUIMark at all you will notice that this version of the player runs this benchmark substantially better on OSX than any previous Flash Player version. It should be up to 3 times faster. How will this affect you? Well, OSX device text rendering got a huge performance boost. If you use lots of device text you will see a big difference. I posted more details in a comment here

Finally the dynamic sound APIs have slightly changed as I announced previously. I will be updating my posts [1][2] later today.

Adobe Pixel Bender in Flash Player 10 Beta

Lee Brimelow has posted a snippet of code showing how to use Adobe Pixel Bender kernels in the Flash Player 10. Time for me to go into details about this feature. As usual there are surprises and unexpected behavior this feature holds. I’ll keep this post without any sample code, but I’ll promise to show some samples soon.

A long time ago, back in Flash Player 8 days we had the idea of adding a generic way to do bitmap filters. Hard coding bitmap filters like we did for Flash Player 8 is not only not flexible, but has the burden of adding huge amounts of native code into the player and having to optimize it for each and every platform. The issue for us has always been how you would author such generic filters. Various ideas were floating around but in the end there was one sticking point: we had no language and no compiler. After Macromedia’s merger with Adobe the Flash Player and the Adobe Pixel Bender team came together and we finally had what we needed: a language and a compiler.

The Pixel Bender runtime in the Flash Player is drastically different from what you find in the Adobe Pixel Bender Toolkit. The only connection the Flash Player has is the byte code which the toolkit does generate, it generates files with the .pbj extension. A .pbj file contains a binary representation of opcodes/instructions of your Pixel Bender kernel, much the same way a .swf contains ActionScript3 byte code. The byte code itself is designed to translate well into a number of different run times, but for this Flash Player release the focus was a software run time.

You heard right, software run time. Pixel Bender kernels do not run using any GPU functionality whatsoever in Flash Player 10.

Take a breath. :-)

Running filters on a GPU has a number of critical limitation. If we would have supported the GPU to render filters in this release we would have had to fall back to software in many cases. Even if you have the right hardware. And then there is the little issue that we only would have enabled this in the ‘gpu’ wmode. So it is critical to have a well performing software fallback; and I mean one which does not suck like some other frameworks which we have tried first (and which I will not mention by name). A good software implementation also means you can reach more customers which simply do not have the required hardware, which is probably 80-90% of the machines connected to the web out there. Lastly this is the only way we can guarantee somewhat consistent results across platforms. Although I have to point out that that you’ll see differences which are the result of compromises to get better performance.

So why did we not just integrate what the Adobe Pixel Bender Toolkit does, which does support GPUs? First, we need to run on 99% of all the machines out there, down to a plain Pentium I with MMX support running at 400Mhz. Secondly, I would hate to see the Flash Player installer grow by 2 or 3 megabytes in download size. That’s not what the Flash Player is about. The software implementation in Flash Player 10 as it stands now clocks in at about 35KB of compressed code. — I am perfectly aware that some filters would get faster by an order of two magnitudes(!) on a GPU. We know that too well and for this release you will have to deal with this limitation. The important thing to take away here is: A kernel which runs well in the toolkit might not run well at all in the Flash Player.

But… I have more news you might not like. ;-) If you ever run a Pixel Bender filter on PowerPC based Mac you will see that it runs about 10 times slower than on an Intel based Mac. For this release we only had time to implement a JIT code engine for Intel based CPUs. On a PowerPC Mac Pixel Bender kernels will run in interpreted mode. I leave it up to you to make a judgment of how this will affect you. All I can say: Be careful when deploying content using Pixel Bender filters, know your viewers.

Now for some more technical details: the JIT for Pixel Bender filters in Flash Player 10 support various instructions sets, down to plain x87 floating point math and up to SSE2 for some operations like texture sampling which take the most amount of time usually. Given the nature of these filters working like shaders, i.e. being embarrassingly parallel, running Pixel Bender kernels scales linearly with amount of CPUs/cores you have on your machine. On an 8-core machine you will usually be limited by memory bandwidth. Here is a CPU readout on my MacPro when I run a filter on a large image (3872×2592 pixels):

There are 4 different ways of using Pixel Bender kernels in the Flash Player. Let me start with most obvious one and come down to the more interesting case:

  • Filters. Use a Pixel Bender kernel as a filter on any DisplayObject. Obvious.
  • Fill. Use a Pixel Bender kernel to define your own fill type. Want a fancy star shaped high quality gradient? A nice UV gradient? Animated fills? No problem.
  • Blend mode. Not happy with the built-in blend modes? Simply build your own.

What about the 4th? Well, as you can see the ones in the list are designed for graphics only. The last one is more powerful than that. Instead of targeting a specific graphics primitive in the Flash Player, you can target BitmapData objects, ByteArrays or Vectors. Not only that but if you use ByteArray or Vector the data you handle are 32-bit floating point numbers for each channel, unlike BitmapData which is limited to 8-bit unsigned integers per channel. In the end this means you can use Pixel Bender kernels to not only do graphics stuff, but generic number crunching. If you can accept the 32-bit floating point limitation.

This 4th way of using Pixel Bender kernels runs completely separate from your main ActionScript code. It runs in separate thread which allows you to keep your UI responsive even if a Pixel Bender kernel takes a very long time to complete. This works fairly similar to a URLLoader. You send a request with all the information, including your source data, output objects, parameters etc. and a while later an event is dispatched telling you that it is finished. This will be great for any application which wants to do heavy processing.

In my next post I show some concrete examples of how you would use these Pixel Bender kernels in these different scenarios. For now I’ll let this information sink in.

What follows are a few random technical nuggets I noted in my specification when it comes to the implementation in the Flash player, highly technical but important to know if you are pushing the limits of this feature:

  • The internal RGB color space of the Flash Player is alpha pre-multiplied and that is what the Pixel Bender kernel gets.
  • Output color values are always clamped against the alpha. This is not the case when the output is a ByteArray or Vector.
  • The maximum native JIT code buffer size for a kernel is 32KB, if you hit this limit which can happen with complex filters the Flash Player falls back to interpreted mode mode like it does in all cases on PowerPC based machines.
  • You can freely mix linear and nearest sampling in your kernel.
  • Maximum coordinate range is 24bit, that means for values outside the range of -4194304..4194303 coordinates will wrap when you sample and not clamp correctly anymore.
  • The linear sampler does sample up to 8bit of sub pixel information, meaning you’ll get a maximum of 256 steps. This is also the case if you sample from a ByteArray with floating point data.
  • Math functions apart from simple multiplication, division, addition and subtracting work slightly differently on different platforms, depending on the C-library implementation or CPU.
  • In the Flash Player 10 beta vecLib is used on OSX for math functions. Slightly different results on OSX are the result. This might change in the final release as the results could be too different to be acceptable. (This is at least one instance where something will be significantly faster on Mac than on PC)
  • The JIT does not do intelligent caching. In the case of fills that means that each new fill will create a new code section and rejit.
  • There are usually 4 separate JIT’d code sections which each handle different total pixel counts, from 1 pixel to 4 pixels at a time. This is required for anti-aliasing as the rasterizer works with single pixel buffers in this case.
  • When an if-else-endif statement is encountered, the JIT switches to scalar mode, i.e. the if-else-ending section will be expanded up to 4 times as scalar code. Anything outside of a if-else-endif block is still processed as vectors. It’s best to move sampling outside of if statements if practical. The final write to the destination is always vectorized.
  • The total number of JIT’d code is limited by the virtual address space. Each code section reserves 128Kbytes (4*32KB) of virtual address space.
  • The first 4 pixels rendered of every instance of a shader is run in interpreted mode, the native code generation is done during that first run. You might get artifacts if you depend on limit values as the interpreted mode uses different math functions. If you are on a multicore system, every new span rendered will create a new instance of a shader, i.e. the code is JITd 8*4 times on a 8-core system. This way the JIT is completely without any locks.

What does GPU acceleration mean?

The just released Adobe® Flash® Player 10 beta version includes two new window modes (wmode) which control how the Flash Player pushes its graphics to the screen.

Traditionally there have been 3 modes:

normal: In this mode we are using plain bitmap drawing functions to get our rasterized images to the screen. On Windows that means using BitBlt to get the image to the screen on OSX we are using CopyBits or Quartz2D if the browser supports it.

transparent: This mode tries to do alpha blending on top of the HTML page, i.e. whatever is below the SWF will show through. The alpha blending is usually fairly expensive CPU resource wise so it is advised not to use this mode in normal cases. In Internet Explorer this code path does actually not going through BitBlt, it is using a DirectDraw context provided by the browser into which we composite the SWF.

opaque: Somewhat esoteric, but it is essentially like transparent, i.e. it is using DirectDraw in Internet Explorer. But instead of compositing the Flash Player just overwrites whatever is in the background. This mode behaves like normal on OSX and Linux.

Now to the new modes:

direct: This mode tries to use the fastest path to screen, or direct path if you will. In most cases it will ignore whatever the browser would want to do to have things like overlapping HTML menus or such work. A typical use case for this mode is video playback. On Windows this mode is using DirectDraw or Direct3D on Vista, on OSX and Linux we are using OpenGL. Fidelity should not be affected when you use this mode.

gpu: This is fully fledged compositing (+some extras) using some functionality of the graphics card. Think of it being similar to what OSX and Vista do for their desktop managers, the content of windows (in flash language that means movie clips) is still rendered using software, but the result is composited using hardware. When possible we also scale video natively in the card. More and more parts of our software rasterizer might move to the GPU over the next few Flash Player versions, this is just a start. On Windows this mode uses Direct3D, on OSX and Linux we are using OpenGL.

Now to the tricky part, things which will cause endless confusion if not explained:

1. Just because the Flash Player is using the video card for rendering does not mean it will be faster. In the majority of cases your content will become slower.

Confused yet? Good, that means you have the same understanding what GPU support means that everyone else has.

Content has to be specifically designed to work well with GPU functionality. The software rasterizer in the Flash Player can optimize a lot of cases the GPU cannot optimize, you as the designer will have to be aware of what a GPU does and adapt your content accordingly. I realize this statement is useless unless we can provide guidance, something we can hopefully achieve in the not to distant future.

2. The hardware requirements for the GPU mode are stiff. You will need at least a DirectX 9 class card. We essentially have the exact same hardware requirements as Windows Vista with Aero Glass enabled. Aero Glass uses exact same hardware functionality we do. So if Aero Glass does not work well on your machine the Flash Player will likely not be able to run well either in GPU mode (but to clarify, you do NOT need Aero Glash for the GPU mode to work in the Flash Player, I am merely talking about hardware requirements here).

3. Pixel fidelity is not guaranteed when you use the GPU mode. You have to expect that content will look different on different machines, even colors might not match perfectly. This includes video. Future Flash Players will change the look of your content in this mode. We will try our best to limit the pain but please bear in mind that in many cases we have no control on this.

Here is an example, left shows it running using the new gpu mode, right using the normal mode. This a video which is 320×240 pixels large showing red text and as you notice the gpu mode arguably looks better as the hardware does UV blending:

The downside in this specific case is that the edges of the text are not as crisp anymore.

4. Frames rates will max out at the screen refresh rate. So whatever frame rate you set in your Flash movie is meaningless if it is higher than 60. This is the case for both the ‘direct’ and ‘gpu’ mode. In most cases you should end up at a frame rate of around 50-55 due to dropped frames which occur from time to time for various reasons.

5. Please do not blindly enable either new mode (gpu or direct) in your content. Creating a GPU based context in the browser is very expensive and will drain memory and CPU resources to the point where the browser will become unresponsive. It is usually best practice to limit yourself to one SWF per HTML page using these modes. The target should be content taking over most of the page and doing full frame changes like video. Never ever, ever enable this for banners. Plain Flex applications should not use these modes either if they are not doing full screen refreshes.

6. GPU functionality ties us together with the video card manufacturers and their drivers. Given that you can expect that a significant amount of customers will not be able to view your content if you enable this mode due to driver incompabilities, and various defects in the software stack.

Finally, this beta version of the Flash Player is not yet tuned for maximum performance in the gpu mode. We are making progress, but all the above points will still apply in the long term.

The ‘direct’ mode should never make your content slower, except in respect to point 4 I made. It should either not change anything or lower CPU consumption somewhat with very large content, i.e. something larger than 1024×768 pixels.

What we’d ask you to do though is to give this a test drive. This is completely new land for us and we expect to encounter lots of obstacles, meaning bugs.

Adobe Is Making Some Noise Part 3

[Update: I have updated the code sample to match the API changes in build 10.0.1.525]

Along with the new event to drive dynamically generated audio playback there is one gaping hole: Where do you get source audio from? Sure, you can load raw audio from external sources through ByteArray but that is less than efficient. What you really need is a way to have sound assets you can access from your ActionScript code.

In Flash Player 10 code named Astro the Sound object will have one more method which is designed to work together with the “samplesCallback” event handler. It will extract raw sound data from an existing sound asset. That means any mp3 file you have in the library or load externally can be accessed and processed. Let’s look at an actual code example. I am passing through audio from an existing sound object to another sound object doing the dynamic audio playback:


var mp3sound:Sound = new Sound();
var dynamicSound:Sound = new Sound();
var samples:ByteArray = new ByteArray();

function sampleData(event:SampleDataEvent):void {
samples.position = 0;
var len:Number = mp3sound.extract(samples,1777);
if ( len < 1777 ) {
// seamless loop
len += mp3sound.extract(samples,1777-len,0);
}
samples.position = 0;
for ( var c:int=0; c < len; c++ ) {
var left:Number = samples.readFloat();
var right:Number = samples.readFloat();
event.data.writeFloat(left);
event.data.writeFloat(right);
}
}

private function loadComplete(event:Event):void {
dynamicSound.addEventListener("sampleData",sampleData);
dynamicSound.play();
}

mp3sound.addEventListener(Event.COMPLETE, loadCompleteMP3);
mp3sound.load(new URLRequest("sound.mp3"));

Notice the extract() call here. This function will extract raw sound data from any existing sound object. The format returned is always 44100Khz stereo, the number format is 32-bit floating point which you can read and write using ByteArray.readFloat and ByteArray.writeFloat. The floating point values are normalized between -1.0 and 1.0. Here is the full prototype of this function:


function extract(target:ByteArray,
length:Number,
startPosition:Number = -1 ):Number;
  • target: A ByteArray object in which the extracted sound samples should be placed.
  • length: The number of sound samples to extract. A sample contains both the left and right channels — that is, two 32-bit floating point values.
  • startPosition: The sample at which extraction should begin. If you don’t specify a value, the first call to extract() starts at the beginning of the sound; subsequent calls without a value for startPosition progress sequentially through the sound.
  • extract() returns the number of samples which could be retrieved. This might be less than the length you requested at the very end of a sound.

With what we provide in Flash Player 10 we hope that we are addressing the most pressing needs of what you want to do with sound. I will likely be just a matter of time until we’ll see high level frameworks done by the community on the magnitude of something like Papervision3D. The next couple of years should be very interesting indeed when it comes to sound on the web.

What’s missing? Unfortunately some features did not make it into Flash Player 10: Extracting audio data from a microphone and extracting audio from a NetStream object. We are aware that both features are highly desirable, but for various reasons it was not possible to make this happen in this release.

Adobe Is Making Some Noise Part 2

[Update: I have updated the code sample to match the API changes in build 10.0.1.525]

The public beta of Adobe® Flash® Player 10, code named Astro has been released. When you read the release notes you’ll notice a small sections talking about audio:

“Dynamic Sound Generation — Dynamic sound generation extends the Sound class to play back dynamically created audio content through the use of an event listener on the Sound object.”

Yes, in Flash Player 10 you will be able to dynamically create audio. It’s not an all powerful API, it is designed to provide a low level abstraction of the native sound driver, hence providing the most flexible platform to build your music applications on. The API has one big compromise which I can’t address without large infrastructural changes and that is latency. Latency is horrible to the point where some applications will simply not be possible. To improve latency will require profound changes in the Flash Player which I will tackle for the next major revision. But for now this simple API will likely change the way you think about sound in the Flash Player.

Programming dynamic sound it is all about how quickly and consistently you can deliver the data to the sound card. Most sound cards work using a ring buffer, meaning you as the programmer push data into that ring buffer while the sound card feeds from it at the same time. The high level APIs to deal with this revolve around two concepts: 1. the device model 2. the interrupt model.

For model 1 we run in a loop (usually in a thread) and write sound data to the device. The write will block if the ring buffer is full. The loop continues until the sound ends. This is the most common method of playing back sound on Unix like systems like Linux.

In model 2 we have a function which is called by the system (usually from an interrupt on older systems) in which the applications fills part of the ring buffer. The callback function is called whenever the sound card hits a point where it runs low on samples in the ring buffer. In an OS without real threading this is usually the only way to make sound playback work. MacOS9 or older and Windows 98 or older were using this system and OSX continues to provide a way to do this in CoreAudio. As ActionScript has no threading it is advisable to use this model. We could use frames event to implement a loop, but that would represent an odd programming model.

Flash Player 10 code named Astro supports a new event on the Sound object: “samplesCallback”. It will be dispatched on regular interval requesting more audio data. In the event callback function you will have to fill a given ByteArray (Sound.samplesCallbackData) with a certain amount of sound data. The amount is variable, from 512 samples to 8192 samples per event. That is something you decide on and is a balance between performance and latency in your application. The less data you provide per event the more overhead is spent in the Flash Player. The more data you provide the longer the latency for your application will be. If you just play continious audio we suggest to use the maximum amount of data per event as the difference in overall performance can be quite large.

I should note that this API will slightly change (names changes only mostly) in the final release of the Flash Player, this beta represents an older build. I’ll update this post with new code once the API is finalized.

Now some real code, some of you on the beta program have seen it. Here is how you play a continuous sine wave in Flash Player 10 with the smallest amount of code:


var sound:Sound = new Sound();
function sineWavGenerator(event:SampleDataEvent):void {
for ( var c:int=0; c<1234; c++ ) {
var sample:Number = Math.sin(
(Number(c+event.position)/Math.PI/2))*0.25;
event.data.writeFloat(sample);
event.data.writeFloat(sample);
}
}
sound.addEventListener("sampleData",sineWavGenerator);
sound.play();

That’s it. That simple. You can’t get any more low level or flexible than this. The sample above is simple, actual code would probably not call Math.sin() in the inner loop. You would rather prepare a ByteArray or Array outside and copy the data from there.

The sound format is fixed at a sample rate of 44100Hz, 2 channels (stereo) and using 32bit floating point normalized samples. This is currently the highest quality format possible within the Flash Player. We will be targeting a more flexible system in a future Flash Player. It was not possible to offer different samples rates and more or less channels in this version. If you need to resample you can either use pure ActionScript 3 or even Adobe Pixel Bender.

The SamplesCallbackEvent.position property which is passed in is the sample position, not the time, of the segment of audio which is being requested. You can convert this value to milliseconds by dividing it by 44.1.

Your event handler has to provide at least 512 samples each time it is dispatched, at most 8192. If you provide less the Flash Player makes the assumption that you reached the end of the sound, will play the remaining samples and dispatch a SOUND_COMPLETE event. If you provide more than 8192 an exception occurs. In the sample above I use 1234 to make it clear that it can be any value between 512 and 8192

The event will be called in real time. That means you can inject new audio data interactively. The key part to understand here that we are not dealing with long amounts of sound data at any given time.

There is an internal buffer in the Flash Player which is about 0.2 to 0.5 seconds depending on the platform which is preventing drop outs. It will automatically be increased if drop outs occur. This internal buffer is the key for the high latency I was alluding to earlier. You should never depend on a certain latency with this API in your application. To enforce this there is a slight random factor in choosing this buffer size when the Flash Player launches.

Continue to read Part 3 which talks about one more new Sound API in Flash Player 10.