9/5/2023

Pure Wasm Life 2: Optimizing Webassembly and Canvas

You know I couldn't leave it alone

Note: This post is part 2 of my Webassembly Game of Life series, if you haven't already you can read part 1 here.

So when I posted my last Webassembly Game of Life post I had a few improvements I wanted to make to it. Some that I mentioned in that post (like bitpacking the board so it took less memory), and some that I mentioned to friends who read it (like switching from Canvas2D to WebGL or WebGPU so the drawing code stays performant with more cells). This post will not address all of those, but I have done some significant work to make my implementation faster.

I'll get into the details in a bit, but first let's show off the new board! This version is drawing at one pixel per game cell (whereas the old version the game cells were 3 pixels wide), so it's simulating and drawing nine times as many cells without a noticeable difference to performance:

My performance target for this was to still run at 60 frames per second on my 2019 Macbook Pro. The previous version averaged around 10-11ms per frame (well within the target 16.6ms for keeping that framerate), and I'm happy to report that this version takes about 11-12ms per frame on that same machine!

Let's break down what I changed to get here.

Overview

A note on performance testing: Javascript is (in most browsers these days) a JIT-executed language, with dynamic optimized re-compilation. This can make performance testing of it very hard, and the performance testing I do for this post is not particularly rigorous.
With that said, all benchmarks were done on the same hardware, in the same power state, with the same browser, and the same applications open. While not an entirely controlled environment, this is good enough to get the overall picture of how much my various attempts at optimization impacted things.

To get an idea of where things were started, I ran the previous post's code unmodified, but with the larger 800x600 board size I wanted to target for this post. I then undertook a few different passes at optimization: first minor Webassembly changes to the game update code, then rewriting the drawing code with WebGL, and then lastly some more drastic Webassembly rewrites.

You can see the average time each of these versions took on my Macbook in the table below:

Version Time taken for game tick Time taken to draw board Computed framerate
Original 46ms 70ms 8.6 fps
Minor WASM Changes 45.8ms 70ms 8.6 fps
WebGL Drawing 44.8ms 0.4ms 22.1 fps
Major WASM Changes 11.8ms 0.4ms 81.9* fps
* This computed framerate is greater than my monitor can display, so the browser caps it to 60 fps

Version	Time taken for game tick	Time taken to draw board	Computed framerate
Original	46ms	70ms	8.6 fps
Minor WASM Changes	45.8ms	70ms	8.6 fps
WebGL Drawing	44.8ms	0.4ms	22.1 fps
Major WASM Changes	11.8ms	0.4ms	81.9* fps

Right off the bat, there were no simulation / game tick changes between the second and third rows, but we do see a change in the measured tick time - this indicates that my measurement method has at least a millisecond or two of inaccuracy, so we can conclude pretty quickly that my initial Webassembly changes were not particularly useful as optimizations.

With that said though, the WebGL drawing was very significant, and my additional Webassembly changes following those were definitely also a huge part of how I reached my target framerate, so let's talk about those in turn.

Drawing Game of Life with WebGL

Taking a look at my old drawing code, I do want to start out and say that I didn't do anything obviously wrong. Iterating through every cell on the board is an O(n) algorithm, but for small board sizes it worked great. However, major browsers have had support for WebGL since about mid-2013 (even earlier if you didn't care about Internet Explorer!) so it's safe to say that I could stand to use the GPU for this.

I'm not going to go through all the WebGL initialization code, but let's briefly look at what Javascript code is run on every frame:

function drawBoard() {
  const { gameExports, width, height, gl } = gameState
  const wasmBuffer = gameExports.shared_memory.buffer
  const boardPointer = gameExports.getBoardPointer()
  const boardLength = gameExports.getBoardLength()
  const boardData = (new Uint8Array(wasmBuffer)).slice(boardPointer, boardPointer + boardLength)

  gl.texImage2D(gl.TEXTURE_2D, 0, gl.ALPHA, width, height, 0, gl.ALPHA, gl.UNSIGNED_BYTE, boardData)
  gl.drawArrays(gl.TRIANGLES, 0, 6)
}

function drawBoard() {
  const { gameExports, width, height, gl } = gameState
  const wasmBuffer = gameExports.shared_memory.buffer
  const boardPointer = gameExports.getBoardPointer()
  const boardLength = gameExports.getBoardLength()
  const boardData = (new Uint8Array(wasmBuffer)).slice(boardPointer, boardPointer + boardLength)

  gl.texImage2D(gl.TEXTURE_2D, 0, gl.ALPHA, width, height, 0, gl.ALPHA, gl.UNSIGNED_BYTE, boardData)
  gl.drawArrays(gl.TRIANGLES, 0, 6)
}

And that's it!

No loops, no iterating over the board, it's basically nothing. I just ask the Webassembly part of my code for a pointer, construct an array view into the memory buffer, and pass it directly to WebGL as a texture.

This simplicity comes from the fact that I'm storing my board as one byte per cell, which means I can use the gl.ALPHA and gl.UNSIGNED_BYTE flags to tell WebGL to expect a texture of exactly that format, without any preliminary data transformation.

Then the WebGL fragment shader just needs to read that data from the texture's alpha channel like so:

precision highp float;
varying vec2 texCoords;
uniform sampler2D textureSampler;
uniform vec3 drawColor;
uniform vec3 backgroundColor;

void main() {
  vec4 textureColor = texture2D(textureSampler, texCoords);
  float alphaValue = textureColor.a;

  vec3 resultColor = mix(backgroundColor, drawColor, alphaValue * 255.0);

  gl_FragColor = vec4(resultColor, 1);
}

precision highp float;
varying vec2 texCoords;
uniform sampler2D textureSampler;
uniform vec3 drawColor;
uniform vec3 backgroundColor;

void main() {
  vec4 textureColor = texture2D(textureSampler, texCoords);
  float alphaValue = textureColor.a;

  vec3 resultColor = mix(backgroundColor, drawColor, alphaValue * 255.0);

  gl_FragColor = vec4(resultColor, 1);
}

There's a correction factor here (multiplying the value by 255), but that's because my board data had only 0s and 1s in the bytes for indicating an alive or dead cell, and the shader expects that texture value to have a range between 0 and 255 (the range of a byte).

The shader also takes two uniform parameters for what colors to draw with, so I will show you where those get set:

export async function onThemeChange() {
  const {gl, shader} = gameState

  const drawColorString = getComputedStyle(gameState.canvas).getPropertyValue('color')
  const drawColor = parseColorString(drawColorString)

  const backgroundColorString = getComputedStyle(gameState.canvas).getPropertyValue('--background')
  const backgroundColor = parseColorString(backgroundColorString)

  const drawColorLocation = gl.getUniformLocation(shader, "drawColor")
  gl.uniform3fv(drawColorLocation, drawColor)

  const backgroundColorLocation = gl.getUniformLocation(shader, "backgroundColor")
  gl.uniform3fv(backgroundColorLocation, backgroundColor)

  drawBoard()
}

export async function onThemeChange() {
  const {gl, shader} = gameState

  const drawColorString = getComputedStyle(gameState.canvas).getPropertyValue('color')
  const drawColor = parseColorString(drawColorString)

  const backgroundColorString = getComputedStyle(gameState.canvas).getPropertyValue('--background')
  const backgroundColor = parseColorString(backgroundColorString)

  const drawColorLocation = gl.getUniformLocation(shader, "drawColor")
  gl.uniform3fv(drawColorLocation, drawColor)

  const backgroundColorLocation = gl.getUniformLocation(shader, "backgroundColor")
  gl.uniform3fv(backgroundColorLocation, backgroundColor)

  drawBoard()
}

This is again pretty simple, just pulling some values out of the CSS properties, parsing them, and then updating WebGL's parameters. This gets run once on page initialization, and then once each time the "Appearance" button in the page navigation is clicked and the page switches between light and dark theme.

That's pretty much it for the drawing, so let's talk about how I made the actual game update tick faster now.

Optimizing the actual Webassembly code

So I don't know that much about how Webassembly execution is implemented into browsers, but I know enough about actual CPU architectures to make some reasonable inferences.

As I started thinking about more significant algorithm changes I could make, I suspected that the largest contributor to time taken in my old algorithm was probably from it taking longer to access the Webassembly linear memory than local or global variables. At an implementation level this could be because of how the wasm memory relates to the CPU cache, but I didn't really care about that - I was just curious to see if reducing the number of memory operations my algorithm took could provide me more of a speed-up.

I knew I was already pretty inefficient in this area - my previous algorithm had made 10 memory calls per cell (8 to check neighbor states, 1 to check the current cell's state, and 1 to store the updated cell's state), so I suspected I could get that much lower.

Reducing memory accesses

I started by dividing up my $getNewValueAtPosition function into a $getNeighborCount and $getOutcomeFromCountAndCurrent, so that I could have different logic for counting the number of alive neighbors on border cells and internal cells while still reusing the logic for actually figuring out what happens to a cell.

Then the $getNeighborCount function became $getNeighborCountBoundary and $getNeighborCountCenter - because for optimizing memory access, I didn't want to have to worry about reading something partially aligned outside the board. I decided the old algorithm would work well enough for boundary cells for that reason.

So my updated $getNewValueAtPosition function looks like this:

(func $getNewValueAtPosition (param $row i32) (param $column i32) (result i32)
  (local $count i32)
  (local $current i32)
  local.get $row
  i32.const 1
  i32.lt_u

  local.get $row
  global.get $boardHeight
  i32.const 1
  i32.sub
  i32.ge_u

  local.get $column
  i32.const 1
  i32.lt_u

  local.get $column
  global.get $boardWidth
  i32.const 1
  i32.sub
  i32.ge_u

  ;; if any of the boundary conditions are true
  i32.or
  i32.or
  i32.or

  if (result i32 i32)
    local.get $row
    local.get $column
    call $getNeighborCountBoundary
  else
    local.get $row
    local.get $column
    call $getNeighborCountCenter
  end

  call $getOutcomeFromCountAndCurrent
)

(func $getNewValueAtPosition (param $row i32) (param $column i32) (result i32)
  (local $count i32)
  (local $current i32)
  local.get $row
  i32.const 1
  i32.lt_u

  local.get $row
  global.get $boardHeight
  i32.const 1
  i32.sub
  i32.ge_u

  local.get $column
  i32.const 1
  i32.lt_u

  local.get $column
  global.get $boardWidth
  i32.const 1
  i32.sub
  i32.ge_u

  ;; if any of the boundary conditions are true
  i32.or
  i32.or
  i32.or

  if (result i32 i32)
    local.get $row
    local.get $column
    call $getNeighborCountBoundary
  else
    local.get $row
    local.get $column
    call $getNeighborCountCenter
  end

  call $getOutcomeFromCountAndCurrent
)

It first checks to see if the cell is along any of the four boundaries, and if so calls $getNeighborCountBoundary. If not it calls $getNeighborCountCenter, and either of these are expected to return two values, which are passed into $getOutcomeFromCountAndCurrent.

You might be wondering why a neighbor count function would return two values, and that's because it turns out it's more efficient to retrieve the current cell's state along with neighbor states (as we'll see in a bit) - so the funciton names are admittedly a bit of a misnomer, as they get the count of alive neighbors and whether the current cell is alive as well.

So the $getNeighborCountBoundary function is the exact same as what I used for every cell last time, but let's look at that new $getNeighborCountCenter function:

(func $getNeighborCountCenter (param $row i32) (param $column i32) (result i32 i32) ;; neighborCount, currentValue
  (local $origIndex i32)
  (local $rowAbove i32)
  (local $rowCenter i32)
  (local $rowBelow i32)

  ;; store current cell's memory position
  call $getBoardPtr
  local.get $row
  local.get $column
  call $getIndexForPosition
  i32.add
  local.tee $origIndex

  ;; load the three rows
  i32.const 1
  i32.sub
  i32.load
  local.set $rowCenter

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.sub
  i32.load
  local.set $rowAbove

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.add
  i32.load
  local.set $rowBelow

  ;; count the number of alive neighbors in each row
  local.get $rowAbove
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  local.get $rowCenter
  i32.const 0x00_01_00_01
  i32.and
  i32.popcnt


  local.get $rowBelow
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  ;; sum neighbor count
  i32.add
  i32.add

  ;; get current
  local.get $rowCenter
  i32.const 0x00_00_01_00
  i32.and
  i32.popcnt
)

(func $getNeighborCountCenter (param $row i32) (param $column i32) (result i32 i32) ;; neighborCount, currentValue
  (local $origIndex i32)
  (local $rowAbove i32)
  (local $rowCenter i32)
  (local $rowBelow i32)

  ;; store current cell's memory position
  call $getBoardPtr
  local.get $row
  local.get $column
  call $getIndexForPosition
  i32.add
  local.tee $origIndex

  ;; load the three rows
  i32.const 1
  i32.sub
  i32.load
  local.set $rowCenter

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.sub
  i32.load
  local.set $rowAbove

  local.get $origIndex
  i32.const 1
  i32.sub
  global.get $boardWidth
  i32.add
  i32.load
  local.set $rowBelow

  ;; count the number of alive neighbors in each row
  local.get $rowAbove
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  local.get $rowCenter
  i32.const 0x00_01_00_01
  i32.and
  i32.popcnt


  local.get $rowBelow
  i32.const 0x00_01_01_01
  i32.and
  i32.popcnt

  ;; sum neighbor count
  i32.add
  i32.add

  ;; get current
  local.get $rowCenter
  i32.const 0x00_00_01_00
  i32.and
  i32.popcnt
)

The first major change to notice is that I'm using the regular i32.load instead of the i32.load8_u function I was using before. This function is pulling back 32 bytes at a time, and since the cells are in consecutive bytes for each row, that means I just need one load instruction per row.

Each row is loaded after computing its memory offset using the $boardWidth variable, and when loaded the cells I'm looking for should be in the three least significant bytes (as multi-byte reads in WASM are little-endian). Following the load calls I can use some bitwise operations to actually isolate and count the live cell bits, and then I use another bitwise and operation to take the already retrieved value for the center row and isolate the current cell's state.

So overall, three memory reads per cell (and there will be one more to write it back, so four total per cell).

One last micro-optimization

Lastly let's take a quick look at that new $getOutcomeFromCountAndCurrent function. Of all the smaller changes I made, this is the one that actually seemed to make the most difference, and it relies on reducing the amount of branching there is in the cell determination.

First, here's what this block looked like in the last post for comparison:

;; Exactly 3 neighbors
local.tee $count
i32.const 3
i32.eq

if
  ;; becomes or stays alive
  i32.const 1
  return
end

;; If currently dead
local.get $row
local.get $column
call $getValueAtPosition
i32.eqz
if
  ;; Stay dead
  i32.const 0
  return
end

;; 2 neighbors
local.get $count
i32.const 2
i32.eq
if
  i32.const 1
  return
end

;; otherwise dead
i32.const 0

;; Exactly 3 neighbors
local.tee $count
i32.const 3
i32.eq

if
  ;; becomes or stays alive
  i32.const 1
  return
end

;; If currently dead
local.get $row
local.get $column
call $getValueAtPosition
i32.eqz
if
  ;; Stay dead
  i32.const 0
  return
end

;; 2 neighbors
local.get $count
i32.const 2
i32.eq
if
  i32.const 1
  return
end

;; otherwise dead
i32.const 0

It's certainly not bad - I mean there's three different if branches, and an extra memory access here to look up the current state even though we could have already gotten that before... but it's readable right?

Now let's look at the new function which replaced this block:

(func $getOutcomeFromCountAndCurrent (param $neighborCount i32) (param $current i32) (result i32)
  ;; get sentinal value for current count
  i32.const 1
  local.get $neighborCount
  i32.shl

  ;; get value mask for current state
  i32.const 0xC ;;alive mask (1100, 3 or 2 neighbors)
  i32.const 0x8 ;;dead mask  (1000, exactly three neighbors)
  local.get $current
  select

  ;; mask out to see if we still have a bit
  i32.and
  i32.popcnt
)

(func $getOutcomeFromCountAndCurrent (param $neighborCount i32) (param $current i32) (result i32)
  ;; get sentinal value for current count
  i32.const 1
  local.get $neighborCount
  i32.shl

  ;; get value mask for current state
  i32.const 0xC ;;alive mask (1100, 3 or 2 neighbors)
  i32.const 0x8 ;;dead mask  (1000, exactly three neighbors)
  local.get $current
  select

  ;; mask out to see if we still have a bit
  i32.and
  i32.popcnt
)

Yeah okay no matter how much I think my first implementation was okay - I can't really defend it when the alternate version is half as many instructions, several of them are constant declarations, and it has literally no branching.

I wish I had known about the i32.popcnt and select instructions before, they're incredibly convenient for stuff like this. This change wasn't nearly as significant as the memory access reduction before, but contributed about a 5ms reduction to my frame times overall.

What's next?

I will probably be continuing with this project again soon. Maybe not as the immediate next post I write again, but I am planning on trying a few more significant optimizations.

The most notable one is I'd still like to bit-pack the board, and update my main cell loop to update several at once. Off the top of my head it seems like I could use this to extend my 3-read/1-write update strategy to do up to 16 cells at once (since I need one more cell to either side, and memory addresses must be byte-aligned).

I also want to experiment with tracking the active regions of the board, as an area which didn't update in the last frame will again not update in the current frame. Where I'm starting with an entirely random board I don't know how much performance that will get me, but it's worth a try.

Anyways, that's all for now - I hope you enjoyed this continued exploration of by-hand Webassembly programming.