Pure Wasm Life 2: Optimizing Webassembly and Canvas
You know I couldn't leave it alone
Note: This post is part 2 of my Webassembly Game of Life series, if you haven't already you can read part 1 here.
So when I posted my last Webassembly Game of Life post I had a few improvements I wanted to make to it. Some that I mentioned in that post (like bitpacking the board so it took less memory), and some that I mentioned to friends who read it (like switching from Canvas2D to WebGL or WebGPU so the drawing code stays performant with more cells). This post will not address all of those, but I have done some significant work to make my implementation faster.
I'll get into the details in a bit, but first let's show off the new board! This version is drawing at one pixel per game cell (whereas the old version the game cells were 3 pixels wide), so it's simulating and drawing nine times as many cells without a noticeable difference to performance:
My performance target for this was to still run at 60 frames per second on my 2019 Macbook Pro. The previous version averaged around 10-11ms per frame (well within the target 16.6ms for keeping that framerate), and I'm happy to report that this version takes about 11-12ms per frame on that same machine!
Let's break down what I changed to get here.
Overview
A note on performance testing: Javascript is (in most browsers these days) a JIT-executed language, with dynamic optimized re-compilation. This can make performance testing of it very hard, and the performance testing I do for this post is not particularly rigorous.
With that said, all benchmarks were done on the same hardware, in the same power state, with the same browser, and the same applications open. While not an entirely controlled environment, this is good enough to get the overall picture of how much my various attempts at optimization impacted things.
To get an idea of where things were started, I ran the previous post's code unmodified, but with the larger 800x600 board size I wanted to target for this post. I then undertook a few different passes at optimization: first minor Webassembly changes to the game update code, then rewriting the drawing code with WebGL, and then lastly some more drastic Webassembly rewrites.
You can see the average time each of these versions took on my Macbook in the table below:
Version Time taken for game tick Time taken to draw board Computed framerate Original 46ms 70ms 8.6 fps Minor WASM Changes 45.8ms 70ms 8.6 fps WebGL Drawing 44.8ms 0.4ms 22.1 fps Major WASM Changes 11.8ms 0.4ms 81.9* fps * This computed framerate is greater than my monitor can display, so the browser caps it to 60 fps
Right off the bat, there were no simulation / game tick changes between the second and third rows, but we do see a change in the measured tick time - this indicates that my measurement method has at least a millisecond or two of inaccuracy, so we can conclude pretty quickly that my initial Webassembly changes were not particularly useful as optimizations.
With that said though, the WebGL drawing was very significant, and my additional Webassembly changes following those were definitely also a huge part of how I reached my target framerate, so let's talk about those in turn.
Drawing Game of Life with WebGL
Taking a look at my old drawing code, I do want to start out and say that
I didn't do anything obviously wrong. Iterating through every cell on the
board is an O(n)
algorithm, but for small board sizes it worked great. However,
major browsers have had support for WebGL since about mid-2013 (even earlier if
you didn't care about Internet Explorer!) so it's safe to say that I could stand
to use the GPU for this.
I'm not going to go through all the WebGL initialization code, but let's briefly look at what Javascript code is run on every frame:
And that's it!
No loops, no iterating over the board, it's basically nothing. I just ask the Webassembly part of my code for a pointer, construct an array view into the memory buffer, and pass it directly to WebGL as a texture.
This simplicity comes from the fact that I'm storing my board as one byte per cell,
which means I can use the gl.ALPHA
and gl.UNSIGNED_BYTE
flags to tell WebGL
to expect a texture of exactly that format, without any preliminary data
transformation.
Then the WebGL fragment shader just needs to read that data from the texture's alpha channel like so:
There's a correction factor here (multiplying the value by 255), but that's because my board data had only 0s and 1s in the bytes for indicating an alive or dead cell, and the shader expects that texture value to have a range between 0 and 255 (the range of a byte).
The shader also takes two uniform
parameters for what colors to draw with, so
I will show you where those get set:
This is again pretty simple, just pulling some values out of the CSS properties, parsing them, and then updating WebGL's parameters. This gets run once on page initialization, and then once each time the "Appearance" button in the page navigation is clicked and the page switches between light and dark theme.
That's pretty much it for the drawing, so let's talk about how I made the actual game update tick faster now.
Optimizing the actual Webassembly code
So I don't know that much about how Webassembly execution is implemented into browsers, but I know enough about actual CPU architectures to make some reasonable inferences.
As I started thinking about more significant algorithm changes I could make, I suspected that the largest contributor to time taken in my old algorithm was probably from it taking longer to access the Webassembly linear memory than local or global variables. At an implementation level this could be because of how the wasm memory relates to the CPU cache, but I didn't really care about that - I was just curious to see if reducing the number of memory operations my algorithm took could provide me more of a speed-up.
I knew I was already pretty inefficient in this area - my previous algorithm had made 10 memory calls per cell (8 to check neighbor states, 1 to check the current cell's state, and 1 to store the updated cell's state), so I suspected I could get that much lower.
Reducing memory accesses
I started by dividing up my $getNewValueAtPosition
function into a $getNeighborCount
and $getOutcomeFromCountAndCurrent
, so that I could have different logic
for counting the number of alive neighbors on border cells and internal cells while
still reusing the logic for actually figuring out what happens to a cell.
Then the $getNeighborCount
function became $getNeighborCountBoundary
and
$getNeighborCountCenter
- because for optimizing memory access, I didn't want
to have to worry about reading something partially aligned outside the board.
I decided the old algorithm would work well enough for boundary cells for that
reason.
So my updated $getNewValueAtPosition
function looks like this:
It first checks to see if the cell is along any of the four boundaries, and if
so calls $getNeighborCountBoundary
. If not it calls $getNeighborCountCenter
,
and either of these are expected to return two values, which are passed into
$getOutcomeFromCountAndCurrent
.
You might be wondering why a neighbor count function would return two values, and that's because it turns out it's more efficient to retrieve the current cell's state along with neighbor states (as we'll see in a bit) - so the funciton names are admittedly a bit of a misnomer, as they get the count of alive neighbors and whether the current cell is alive as well.
So the $getNeighborCountBoundary
function is the exact same as what I used for
every cell last time, but let's look at that new $getNeighborCountCenter
function:
The first major change to notice is that I'm using the regular i32.load
instead of the i32.load8_u
function I was using before. This function is
pulling back 32 bytes at a time, and since the cells are in consecutive bytes
for each row, that means I just need one load instruction per row.
Each row is loaded after computing its memory offset using the $boardWidth
variable, and when loaded the cells I'm looking for should be in the three
least significant bytes (as multi-byte reads in WASM are little-endian).
Following the load calls I can use some bitwise operations to actually isolate and
count the live cell bits, and then I use another bitwise and operation to take
the already retrieved value for the center row and isolate the current cell's state.
So overall, three memory reads per cell (and there will be one more to write it back, so four total per cell).
One last micro-optimization
Lastly let's take a quick look at that new $getOutcomeFromCountAndCurrent
function. Of all the smaller changes I made, this is the one that actually
seemed to make the most difference, and it relies on reducing the amount of
branching there is in the cell determination.
First, here's what this block looked like in the last post for comparison:
It's certainly not bad - I mean there's three different if
branches,
and an extra memory access here to look up the current state even though
we could have already gotten that before... but it's readable right?
Now let's look at the new function which replaced this block:
Yeah okay no matter how much I think my first implementation was okay - I can't really defend it when the alternate version is half as many instructions, several of them are constant declarations, and it has literally no branching.
I wish I had known about the i32.popcnt
and select
instructions before,
they're incredibly convenient for stuff like this. This change wasn't nearly
as significant as the memory access reduction before, but contributed about a 5ms
reduction to my frame times overall.
What's next?
I will probably be continuing with this project again soon. Maybe not as the immediate next post I write again, but I am planning on trying a few more significant optimizations.
The most notable one is I'd still like to bit-pack the board, and update my main cell loop to update several at once. Off the top of my head it seems like I could use this to extend my 3-read/1-write update strategy to do up to 16 cells at once (since I need one more cell to either side, and memory addresses must be byte-aligned).
I also want to experiment with tracking the active regions of the board, as an area which didn't update in the last frame will again not update in the current frame. Where I'm starting with an entirely random board I don't know how much performance that will get me, but it's worth a try.
Anyways, that's all for now - I hope you enjoyed this continued exploration of by-hand Webassembly programming.