Flash to iOS Performance Tests

24.10.2011

Last week, I sat down to do some intense Flash to iOS Performance tests to get an impression what the best approaches for porting Save the Maidens to iOS would be. Besides testing out a couple of tweaks that were supposed to improve iOS performance, I wanted first hand proof whether pure blitting actually was the holy performance grail it is proclaimed to be, when it comes to porting animations.

As described below, I’ve come up with another approach in this years spring which works with bitmaps as well but instead of copying pixels around, simply slices up the initial sprite sheet containing all animation states and assigns the sliced bitmapDatas to bitmaps on stage.
I needed proof whether this was worth anything compared to blitting, since plenty of developers including some on the Adobe front promote blitting techniques as the one true solution for an acceptable mobile performance.

But before we dive into the test results, let’s take a closer look at the test setup and various techniques that I was planning to sound out and compare. For those of you aching to learn about the final results, I can already tell you at this point: get ready for some surprises! Here’s a quick index of the content of this post:

The Setup

I used the following setup for my tests:

  • Device: iPad 1st generation
  • iOS version: 4.3.5
  • Compiled with: Flex SDK 4.5.1
  • Packaged with: AIR SDK 3.0
  • Build type: ipa-test
  • IDE: FDT 4.3
  • Target FPS: 60

The app will be aiming to deliver 60 FPS, while I’m increasing the load by …

  1. … adding one animated object every 10 frames to a maximum of 100 objects. All objects run the same animation and are being moved 1 pixel to the right on every frame.
  2. … eventually adding collision detection between all objects on stage.

Once all 100 objects are on stage, moving (and detecting collisions) I’m going to measure the “final” FPS the app still delivers.

Animation Techniques

Now, that we have the setup, there are the various techniques that are up for testing and comparison:

Frame Blitting …

I guess, everybody has already heard of “Blitting” as a technique for displaying animations: The basic concept behind it is having a large (probably screen filling) bitmap into which the pixels of the current animation frame of each animated object are being copied directly from the sprite sheet:

bitmap.bitmapData.copyPixels(frame.bitmapData, ...);

A sprite sheet (or tile sheet or animation sheet) is a bitmap containing all frames of all animations an object has. In “the old days” the sizes of those bitmaps as well as those of each animation frame used to be a power of 2 (e.g. 4, 8, 16, 32, 64 etc.) because it would be the least hindering way for computers to work with them. As a negative side effect, though, the bitmaps would naturally increase in size to match the next highest power of 2 (POT) dimension.

In this setup, I’m going to test both: bitmaps and frames with POT based dimensions and such with even but not necessarily POT based dimensions.

In my blitting setup, I wrote an object that basically uses as little as a rectangle to keep track of it’s own position and dimensions along with information of it’s current animation frame position and size on the sprite sheet.
In a global enter frame loop I use this information to move all objects and copy the pixels of their current frames into one large bitmap on stage.

… vs. Frame Assignment

As an alternative to blitting, I developed a technique that works with bitmaps wrapped into Flash Sprite objects (UPDATE: wrappers removed in Test #5 and #6 to improve speed). I called these objects BitmapClips, since they kind of work like Bitmap based MovieClips.

The original animation sprite sheet is being sliced up into frames (BitmapDatas) and piled up in arrays by a processor which afterwards provides the clip instances with ready-made animations.

In contrast to my blitting approach the clips are actually added to stage. While the Sprite instance is used for positioning and provides all functionality Flash has to offer, animation frames are being swapped by simply assigning the respective BitmapData to the Bitmap instance contained:

bitmap.bitmapData = frame.bitmapData;

So, instead of copying pixels around manually, all I do is set a pointer to another BitmapData instance in memory.

Considering the size of the DisplayList, this – of course – is not a very effective approach. However, I thought it might eventually consume less CPU and, thus, make a stand against blitting.

Advantages and Disadvantages

There are some advantages and disadvantages I already see for either technique, though, without having actually tested it and Iguess, they are worth considering even before looking at the performance:

  • Assigning frames is rather simple to implement and easily combined with game object implementations, but produces overhead with the bitmap and sprite instances each object consists of. Plus, the process is not transparent from the point when you’re assigning the bitmap data.
  • Copying pixels is fast, supposedly, since you’re dodging the Display List. But a good blitting system is complex and can easily waste precious CPU performance when not well elaborated and tested in detail. Plus, you’re facing issues like depth sorting you wouldn’t have using objects in the Display List.

So, I expect implementation time to be a crucial factor here, as well.

Object Pooling

I’ve heard from various people experimenting with Flash to iOS portations, that Object Pooling has positive effects, especially on in-game performance.

Object Pooling allows pre-generation and recycling of objects and, thus, avoids costly memory allocation during the game – widely known as a potential performance killer.
Using an object pool for animated game objects, one pre-generates the maximum amount of instances simultaneously displayed (estimate if unsure) and adds them to stage at a point in time, no ditches in performance are visible e.g. during the display of a splash screen. During the game itself objects are simply drawn from the pool and flushed back into it for recycling. They are never fully removed from the stage though, to avoid the negative effects on performance bound to this action.

Without object pooling, objects are created and attached to the stage on demand. In the following tests, I’ll be using object pooling as a default. I ran tests with object pooling turned off, but couldn’t find any remarkable performance drops. I guess that this test setup is probably not big enough to actually prove this concept. Also it completely lacks recycling objects.

I assume that it’s wise to use object pooling, and I’ll do so although I havn’t really proven it’s positive effects, yet. But that’s ok for now. Let’s head to the tests!

Test #1 Blitting vs. Assigning

In the following test I’m going to compare blitting vs. assigning animation frames.

In all of these tests, Object Pooling is turned on by default, generating all objects at app start and instantly adding them to the stage (not possible with blitting, since it works with one large bitmap instead of addable objects).

Both on-device render modes were tested: GPU and CPU based rendering.

At first, the sprite sheet and all animation frames are not power of 2 based but have even dimensions:

Technique: Blitting
Sheet & frame dims: no POT

Max FPS: 23 (GPU) 32 (CPU)
Final FPS: 19 (GPU) 23 (CPU)

Technique: Assigning
Sheet & frame dims: no POT

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 31 (GPU) 35 (CPU)

GPU: Stable at 60 FPS until ~75 objects. Rapid FPS loss afterwards.
CPU: Stable at 60 FPS until ~25 objects. Slow FPS loss afterwards.

Conclusions

A rather unexpected result: While blitting manages to deliver a mere 32 FPS in CPU mode even with as little as 1 object on stage, the alternative assigning technique manages to deliver full 60 FPS in both rendering modes.
The application performs worst when blitting in GPU mode and runs best in GPU mode as long as we’re below ~75 objects

Obviously, blitting does not provide optimal results for running Flash ported sprite sheet based animations on the iOS platform. We achieve significantly better results assigning the animation frames to bitmap instances on stage.

Rendering in CPU mode seems to perform better than rendering in GPU mode – at least with up to 100 objects.

Test #2 POT dimensions

To rule out, that sprite sheet dimensions have a negative effect on the results from Test #1, I’m going to use a sprite sheet holding animation frames, which have dimensions based on the power of 2: every frame now has the smallest possible dimension of 128×128. Before they varied between either 80×90 or 80×120.

As a side effect, I had to increase the sprite sheet’s dimensions to 512×1024 pixels and thus inflated the amount of pixels in use by 100% . Let’s see how that affects performance:

Technique: Blitting
Sheet & frame dims: POT

Max FPS: 24 (GPU) 34 (CPU)
Final FPS: 16 (GPU) 19 (CPU)

Technique: Assigning
Sheet & frame dims: POT

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 18 (GPU) 27 (CPU)

GPU: Stable at 60 FPS until ~40 objects. Sudden drop from 40 FPS to 20 FPS at ~78 objects.
CPU: Stable at 60 FPS until ~20 objects. Slow FPS loss afterwards.

Conclusions

The performance is significantly worse than before – with either technique.

If POT based dimensions have a positive effect on processing animation frames, it is apparently overruled by the massive performance loss caused by processing the increased frame sizes.

So, from this point forth, we can safely ditch blitting in future tests and concentrate on the alternative technique: assigning animation frames.

Test #3 Collision detection

At this point, I’m going to bring in collision detection, continually testing all objects on stage against one another (no double testing), as their number grows.

With collision detection involved, I kinda expect the GPU to fail this one pretty bad.

Just to get an impression of what impact larger sprite sheet and animations frames have in each render mode, I used the POT dimensioned sheet again in a second test run:

Sheet & frame dims: no POT
Collision detection: hitTestObject

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 17 (GPU) 14 (CPU)

GPU: Stable at 60 FPS until ~55 objects. Drop below 30 FPS at~75 objects.
CPU: Stable at 60 FPS until ~20 objects. Drop below 30 FPS at ~55 objects.

Sheet & frame dims: POT
Collision detection: hitTestObject

Max FPS: 60 (GPU) 59 (CPU)
Final FPS: 17 (GPU) 12 (CPU)

GPU: Stable at 60 FPS until ~40 objects. Drop below 30 FPS at ~75 objects.
CPU: Drop below 30 FPS at ~45 objects.

Conclusions

As expected, the FPS go down massively, but what really surprises me is that the GPU manages to handle up to 55 objects with as much as 60 FPS before starting to give in. It even manages to deliver more FPS with 100 objects on stage – an object amount the CPU used to dominate.

What’s more: using larger images with twice the pixels on average reduces the amount of objects that can be handled with 60 FPS only by 25%.

What’s most interesting – or rather remarkable –  is the fact that no matter the image size, the GPU delivers a solid 30 FPS until it reaches approx 75 objs. Then it drastically drops (see also Test #2).

Test #4 Up Scaling

Another tip I got from Marvin Blase aka @beautifycode is supposed to save precious app byte size and memory in use by identifying fast moving objects in your game and creating the sprite sheets for these objects half the size they’re supposed to be displayed. Within the game these animations are then scaled up to 200% using the instances’ scaleX and scaleY attributes.

The declared aim of the follwing test was, to identify the effects (positive and negative) of moving scaled up sprites and detecting collisions among them.

In the first run, I used a scaled down version of the compact sprite sheet with no POT dimensions. Afterwards I used a 50% reduced version of the POT sprite sheet as a comparison and to top things off, I even turned collision detection back on in the last test run:

Sheet & frame dims: no POT
Collision detection: off

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 31 (GPU) 22 (CPU)

Sheet & frame dims: POT
Collision detection: off

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 17 (GPU) 12 (CPU)

Sheet & frame dims: no POT
Collision detection: hitTestObject

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 19 (GPU) 12 (CPU)

GPU: Stable at 60 FPS until ~55 objects. Drop below 30 FPS at ~80 objects.

Observations

While the graphics looked rather blocky in CPU mode, they were slightly blurred on GPU, which made them look rather smooth. I believe very well, that when in fast motion, this doesn’t make much of a difference to the unscaled visuals.

Conclusions

While the GPU seems unaffected by scaling up the images, the CPU seems to suffer tremendously and  looses a full 13 frames compared to Test #1.

As expected, the results are even worse with the larger frame images.

What’s interesting to see, though, is that with collision detection the GPU actually profits from this technique and makes an additional 2 FPS compared to Test #3 where we used the regularly sized images. Also the app seemed to be capable of displaying 5 more objects before dropping below the commonly used frame rate of 30FPS and achieved an actual 90 objects before falling below 25FPS.

[UPDATE]

Test #5 No Sprite Wrappers

With the bitmaps wrapped into Sprite containers, the above setup sure had improvement potential as Damian correctly pointed out in the comments. So today I followed that exact same TODO that I found in my comments ;) and removed the Sprite wrapper around each animation and, thus, flattened the Display List by 100 objects.

I basically ran a mixture of Test #1 and Test #3 with this (only assigning bitmapDatas and using no POT sprite sheets), to again see what a difference 100 Sprites can make and was suprised as I managed to squeeze out even more frames. But the CPU still didn’t hold a candle to the performance on the GPU, so, to save some time, I started neglecting it afterwards.

I started with collision detection turned on, to directly compare the results to Test #1, then turned it and, eventually, even applied the scaling technique from Test #4. With the last test I started neglecting CPU comparisons since, so far, they always turned out worse:

Collision detection: off
Scaling technique: off

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 31 (GPU) 35 (CPU)

GPU: Stable at 60 FPS until ~75 objects.
CPU: Framerate dropping right away.

Collision detection: hitTestObject
Scaling technique: off

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 20 (GPU) 15 (CPU)

GPU: Stable at 60 FPS until ~50 objects. Drop below 30/25 FPS at ~80/~90 objects.

Collision detection: hitTestObject
Scaling technique: on

Max FPS: 60 (GPU)
Final FPS: 19 (GPU)

Stable at 60 FPS until ~50 objects. Drop below 30/25 FPS at ~78/~87 objects.

Conclusions

This is odd: 100 missing Sprite wrappers seem to have no impact at all as long as there is no collision detection in play. The results are the exact same one we received in Test #1.

On the other hand, when collision detection comes into play, the missing wrappers squeeze out one more frame on the CPU and 3 frames on the GPU lifting the frame rate up to a magnificent 20. That’s huge!
The details show, that we’re able to animate, move and hitTest around 80 objects on stage before passing the crucial frame rate of 30 which is 5 more than with the wrappers around the bitmaps.

90 objects at 25 FPS might actually be something we can work with in most games considering we’ll be having large background graphics and other elements that might lower the frame rate some more. Great!

Sadly, the scaling method does not seem to profit from the missing Sprite wrappers, whysoever.

Test #6 Rotation

Rotation always comes into play at some point in a game, so I wanted to check this as well. So, in the following test, I rotate each object by one degree per frame. Note that this test also runs without any Sprite wrappers which makes the numbers a little hard to compare against Tests #1 to #4 but the important comparison is against Test #5 anyhow, so I think that’s alright.

First, I let all 100 objects rotate with collision detection turned off. Afterwards I turned smoothing on, which, by default, is set to false in my setup. In the last run,  turn collision detection on and smoothing off again to be able to compare the results to the first run:

Collision detection: off
Smoothing: false

Max FPS: 60 (GPU)
Final FPS: 31 (GPU)

Drop below 60 FPS at ~80 objects.

Collision detection: off
Smoothing: true

Max FPS: 60 (GPU)
Final FPS: 31 (GPU)

Drop below 60 FPS at ~80 objects.

Collision detection: hitTestObject
Smoothing: false

Max FPS: 60 (GPU)
Final FPS: 17 (GPU)

Stable at 60 FPS until ~50 objects. Drop below 30/25 FPS at ~78/~85 objects.

Observations

Setting smoothing to true appears to have no impact whatsoever on the result – not only in terms of FPS but also visually. The graphics look smoothed in either setup, something that apparently comes naturally when running in GPU render mode and also responsible for the blurred graphics in Test #4.

Conclusions

While smoothing appears to have no effect on neigher the frame rate nor the visual results, the GPU handles the rotation rather effortlessly. It loses 3 frames in comparison to the improved BitmapData assigment results from Test #5. What seems a little odd, though, is that it seems to be capable of handling more rotating clips at the same speed than clips that are not. I guess, this is due to the higher amount of overlapping pixels in the runs with rotation, which results in fewer pixels changing per frame. It’s the only explanation I can imagine at this point.

Test #7 Ad Hoc Version

As a final test, I decided to create an ad hoc version from the ones that delivered the best results, which were the ones from Test #5 using the assign technique without scaling. As before, I turned on collision detection in the second run:

Collision detection: off

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 31 (GPU) 36 (CPU)

GPU: Stable at 60 FPS until ~70 objects. Significant FPS drop (around 20) at ~85 objects.
CPU: Stable at 60 FPS until ~20 objects.

Collision detection: onHitTest

Max FPS: 60 (GPU) 60 (CPU)
Final FPS: 24 (GPU) 17 (CPU)

GPU: Stable at 60 FPS until ~60 objects. Drop below 30/25 FPS at ~90/~98 objects.

Conclusions

Going from “test” to “ad hoc” version, has blessed us with another 4 frames. But the performance gain seems to kick in mainly when there is more involved than just mere display of animations.

I tested blitting as well and got pretty much the same disappointing numbers as with the test version, which is why I didn’t mention it explicitly here.

However, we’ve now received even better results than before running 100 objects all hit testing one another at marvellous 24 FPS. I guess these numbers are assuring enough to finally get me started on porting Save the Maidens.
[/UPDATE]

Final Conclusions

So, after all this testing, I think I can safely sum up the results into the following rules and guidance tips when developing games for iOS:

  • Forget about blitting – assign sliced BitmapDatas instead.
  • Forget about power of 2 dimensions – pack your sprite sheets tightly and save pixels (keep even dimensions).
  • GPU render mode works best for most setups.
  • Exception: no collision detection /no scaling involved: CPU mode may work better.
  • UPDATE: Keep the Display List flat: 100 Sprite wrappers made a difference of 3 FPS on the GPU.
  • UPDATE: Forget about smoothing Bitmaps – it makes no difference on the GPU.
  • UPDATE: Test with release (ad hoc) versions early. There’s hidden performance in there.
  • Try scaling down sprite sheets of fast objects by half and scale them up in code again (GPU only).
  • Object Pooling may have positive effects on the large scale or long run

Well, at least these results apply for 1gen iPads. Today, I received shipping information about my brand new iPhone 4S and, I guess,once I find the time, I’m gonna rerun some of these tests on it as well as on my very old iPhone 3G (no S ;)) and then update this post.

I hope these results help you plan your first or next Flash to iOS portation a bit or at least they save you some time finding the right setup for your project.

If have other or additional findings or found flaws in any of my setup, I’d be happy to read about them in the comments so we can all learn and improve.