Joachim Breitner

Winter is coming even more quickly

Published 2019-11-17 in sections English, Haskell.

TL;DR: I explain how I improved the performance of an interpreter for WebAssembly written in Haskell by plucking some low-hanging fruit.

Background

Motivated by my work at the DFINITY Foundation, I was looking into interpreters for WebAssembly written in Haskell, and found my colleague John Wiegley’s winter: A straight-forward port of the WebAssembly reference interpreter, written in Ocaml by Andreas Rossberg (another colleague of mine … I guess there is a common theme here.)

Obviously, an interpreter will never be as fast as a real compiler implementation (such as the one in v8, lucet or wasmtime). But for my purposes an interpreter is fine. Nevertheless, I don’t want it to be needlessly slow, and when I picked up wasm, it was clear that I had to work at least a little bit on performance.

None of this is to blame John, of course: I have no doubt that he could have sped it up at least as well! I assume it just wasn't a priority at that time, and winter is a nicely done piece of Haskell.

A graph

I like statistics and graphs! In this blog post, I am always running the following very simple C program:

unsigned int i;
void init () {
  i = 1024*1024;
  while (i--) {};
}

which I compile using clang version 9 and run with wasm-invoke from the winter package (compiled with GHC-8.6.5):

$ clang-9 --target=wasm32-unknown-unknown-wasm -fno-builtin -ffreestanding loop.c --compile
$ wasm-ld-9 loop.o --gc-sections --export init --no-entry -o loop.wasm
$ wasm-invoke -w loop.wasm -f init +RTS -t

The +RTS t tells the Haskell runtime to produce a line with some statistics, of which I collect

  • total number of byte allocated
  • maximum memory size used and
  • total time elapsed

To reduce the noise a little bit, I run the program five times, and take the minimum of all the runs.

I did this for all the commits I produce while trying to make the code faster. This yields the following graph:

The fruit of my labor

The fruit of my labor

Note that I had to draw this with a logarithmic y axis, because some of the improvements were so drastic that a linear graph would not be helpful.

But how did I get there? I will take you through a guided tour through all the commits that have improved performance, explain how I found an performance issue, what I did to improve things and why that improved things. To keep this post at a reasonable size, I split it into a series of blog post, faithful to the chronological order in which I performed them:

Here they are:

  1. Vectors
  2. SPECIALIZE
  3. Difference lists
  4. Export lists
  5. Eta-Expanding ReaderT
  6. Simpler code
  7. A Zipper
  8. Statistics (the making-of)

Was it worth it? I’d say so! Initially, running the program above would take 44 minutes (slow!) and continuously grows to require 6,7 GB of memory (Houston, we have a space leak!).

After all the steps above, memory consumption is constant and below 2MB, and the program finishes in 3,6s. Overall, this yields the following, very satisfying, statistics:

Improvement: Allocations: -99.60% Memory: -99.97% Time: -99.86% (Commits de9f4f8...5406efd)

By the way, the Ocaml reference interpreter takes 2,92s to run this (and it does a bit more, such as validation of the module). So there is still room for improvement…

Comments

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.