Joachim Breitner

Blog

Faster Winter: Statistics (the making-of)

Published 2019-11-24 in sections English, Haskell.

(This is an appendix to the “faster winter” series, please see that post for background information.)

Did you like the graph and the stats that I produced? Just for completeness, I am including the various scripts I used. Nothing super exciting to see here, but maybe someone finds this useful.

This little shell one-liner collects the run-time statistics for each commit in the interesting range (line-wrapped for your convenience):

for h in $(git log 1cea7652f48fad348af914cb6a56b39f8dd99c6a^..5406efd9e057aebdcf94d14b1bc6b5469454faf3 --format=%H)
do
  echo -n "$h"
  git checkout -q "$h"
  cabal new-build -v0
  echo -n ":"
  rm -f stats/$h.txt
  for i in $(seq 1 5)
  do
    cabal -v0 new-run exe:wasm-invoke -- -w loop.wasm  -f canister_init +RTS -t >/dev/null 2>> stats/$h.txt
    echo -n .
  done
  echo
done

A small Perl script takes the minimum for each measurement across the five runs, and produces a CSV file:

#!/usr/bin/perl

use List::Util qw(min);

my @alloc;
my @in_use;
my @time;

while (<>) {
  m!<<ghc: (\d+) bytes, \d+ GCs, \d+/\d+ avg/max bytes residency \(\d+ samples\), (\d+)M in use, [\d.]+ INIT \(([\d.]+) elapsed\), [\d.]+ MUT \(([\d.]+) elapsed\), [\d.]+ GC \(([\d.]+) elapsed\) :ghc>>! or die $!;
  push @alloc, 0+$1;
  push @in_use, $2;
  push @time, $3+$4+$5;
}

printf "%d;%d;%f\n", min(@alloc), min(@in_use), min(@time);

To create a full file for all the commits in the range that have files, I used this bash one-liner (again line-wrapped for your convenience):

echo 'commit;allocations;memory;time' > stats.csv
for h in $(git log 1cea7652f48fad348af914cb6a56b39f8dd99c6a^..5406efd9e057aebdcf94d14b1bc6b5469454faf3 --format=%H|tac)
do
  git log -n 1 --oneline $h
  test -f stats/$h.txt && echo "$(echo $h|cut -c-7);$(./tally.pl < stats/$h.txt)" | tee -a stats.csv
done

The stats can be turned into the graphc using pgfplots by compiling this LaTeX file:

\documentclass[class=minimal]{standalone}
\usepackage{mathpazo}
\usepackage{pgfplots}
\definecolor{skyblue1}{rgb}{0.447,0.624,0.812}
\definecolor{scarletred1}{rgb}{0.937,0.161,0.161}
\pgfplotsset{width=12cm,compat=newest}

% From https://tex.stackexchange.com/a/63340/15107
\makeatletter
\pgfplotsset{
    /pgfplots/flexible xticklabels from table/.code n args={3}{%
        \pgfplotstableread[#3]{#1}\coordinate@table
        \pgfplotstablegetcolumn{#2}\of{\coordinate@table}\to\pgfplots@xticklabels
        \let\pgfplots@xticklabel=\pgfplots@user@ticklabel@list@x
    }
}
\makeatother

\begin{document}
\begin{tikzpicture}

\pgfplotsset{every axis/.style={ymin=0}}
\begin{semilogyaxis}[
  skyblue1,
  scale only axis,
  axis y line*=left,
  ylabel=Allocation (bytes),
  flexible xticklabels from table={stats.csv}{[index]0}{col sep=semicolon},
  xticklabel style={rotate=90, anchor=east, text height=1.5ex, font=\ttfamily, color=black},
  xtick=data,
  ]
\addplot[const plot mark mid, color=skyblue1]
  table [x expr=\coordindex+1, y index=1, col sep=semicolon] {stats.csv};
\end{semilogyaxis}

\begin{semilogyaxis}[
  green,
  scale only axis,
  axis y line*=right,
  ylabel=Memory (MB),
  x tick style={draw=none},
  xtick=\empty,
  ]
\addplot[const plot mark mid, color=green]
  table [x expr=\coordindex+1, y index=2, col sep=semicolon] {stats.csv};
\end{semilogyaxis}


\begin{semilogyaxis}[
  red,
  scale only axis,
  axis y line*=right,
  ylabel=Time (seconds),
  x tick style={draw=none},
  xtick=\empty,
  ]
\pgfplotsset{every outer y axis line/.style={xshift=2cm}, every tick/.style={xshift=2cm}, every y tick label/.style={xshift=2cm} }
\addplot[const plot mark mid, color=red]
  table [x expr=\coordindex+1, y index=3, col sep=semicolon] {stats.csv};
\end{semilogyaxis}
\end{tikzpicture}
\end{document}

And finally this Perl script allows me to paste any two lines from the CSV file and produces appropriate Markdown for the “improvement” lines in my posts:

#!/usr/bin/perl

my $first = 1;

my $commit;
my $alloc;
my $in_use;
my $time;

while (<>) {
  /(.*);(.*);(.*);(.*)/ or die;
  unless ($first) {
    printf "**Improvement**: Allocations: %+.2f%%  Memory: %+.2f%%  Time: %+.2f%% (Commit [%s...%s](http://github.com/dfinity/winter/compare/%s...%s))\n",
      (100 * ($2/$alloc - 1)),
      (100 * ($3/$in_use - 1)),
      (100 * ($4/$time - 1)),
      $commit,
      $1,
      $commit,
      $1;
  }
  $first = 0;
  $commit = $1;
  $alloc = $2;
  $in_use = $3;
  $time = $4;
}

Faster Winter 7: The Zipper

Published 2019-11-24 in sections English, Haskell.

(This is the seventh optimization presented in the “faster winter” series, please see that post for background information.)

The last bit of performance could be considered a domain-specific optimization, as one might describe it as “introducing a control stack to the interpreter”. But in a different light, one could consider it the introduction of a more appropriate data structure, by applying a “zipper”-like transformation to the existing data structure.

The problem is that the state of the system (datatype Code) is a stack of values and a stack of instructions. Simplifying this for this post, we might have

data Code = Code [Value] [Instr]
data Instr
  = Const Value | Add | Sub | Return
  | Labled Int Code

The interpreter gets such a Code, but does not always just execute the top-most instruction: If that is a Labeled instruction, it has to execute the next instruction in the argument of that Labeled. (It is not very important at this point to understand what a Labeled is used for). So we might have a Code that looks like the following:

c1 = Code [] [Labeled 1 (Code [2] [Labeled 2 (Code [3,4] [Add]), Sub]), Return]

The next instruction to execute is actually the Add. But in order to find that, the function step has to look under the first Labeled, look under the second Labeled, then execute step (Code [3,4] [Add]) = Code [7] [], and finally wrap it again in the two Labeled, to produce:

c2 = Code [] [Labeled 1 (Code [2] [Labeled 2 (Code [7] []), Sub]), Return]

Then eval calls step again, and step has to look inside the Labeled again to find the next instruction to execute.

It would be easier if the next instruction to execute would be presented to step more conveniently, right as a field of Code. But in order to do this, we have to move the Labeled entries “out of the way”. I do that by adding a new, first parameter to Code where I keep a stack of all the Label constructor that were in the way, in reverse order. So the c1 above would now be

data Code = Code Control [Value] Instr
data Control = Empty | Labeled Int Code

c1' = Code (Labeled 2 (Code (Labeled 1 (Code Empty [] [Return])) [2] [Sub]) [3,4] [Add]

Can you see how this relates to c1 above? The important piece here is that the interpreter finds the next instruction to execute always at the head of the instruction list right of the outermost code, and as long as there is something to execute there, it doesn't have to touch the Control stack at all.

This required touching some more lines of code than the previous optimizations, but doing so didn't require much creativity, as the old and new Code types are in clear correspondance, and that guides me in how to use adjust the code. And it’s certainly worth it:

Improvement: Allocations: -46.17% Time: -36.88% (Commit e66f1e0...57f3e9d)

Faster Winter 6: Simpler Code

Published 2019-11-24 in sections English, Haskell.

(This is the sixth optimization presented in the “faster winter” series, please see that post for background information.)

The Wasm reference interpreter has a function step that performs one single one step of evaluation, by taking the state before (in a type called code), and returning the state after the step. The function eval calls step, checks if the result is “done” (no instructions left to do), and if it is not done, recursively calls eval again. This way, evaluation progresses step by step until it is done.

The Haskell port follows the same pattern of a single-step step and a driver eval, but it chose to write the code continuation passing style: Instead of returning, the function step takes a function that it passes the result to. So the code looks like this (slightly simplified):

type CEvalT m a = ReaderT Config (ExceptT EvalError m) a

step :: PrimMonad => Code -> (Code -> CEvalT m r) -> CEvalT m r
step c k =  … k new_c …

eval :: PrimMonad => Code -> CEvalT m (Stack Value)
eval c =
    if is_done c
    then stack c
    else step c eval

There must have been a good reason to prefer this style over the plain style, but I was curious if it was really helpful. So I changed it to the following, more straight-forward code:

type CEvalT m a = ReaderT Config (ExceptT EvalError m) a

step :: PrimMonad => Code -> CEvalT m Code
step c k =  … new_c …

eval :: PrimMonad => Code f m -> CEvalT f m (Stack Value)
eval c =
    if is_done c
    then stack c
    else step c >>= eval

And indeed, the simpler code worked better:

Improvement: Allocations: -9.6 Time: -16.91% (Commit eb8add3...e66f1e0)

Again, avoiding function values (as we construct them as the contination closure) might have helped here.

Or maybe the simpler code allowed GHC to notice that the Code value, which is simply a tuple with a different name, is constructed by step but immediatelly deconstructed by eval, and thus GHC could optimize that away (by “unboxing” the argument and/or result of step and simply passing the components of the tuple).

And finally, independent of all the performance questions, it also makes the code easier to understand.

Faster Winter 5: Eta-Expanding ReaderT

Published 2019-11-22 in sections English, Haskell.

(This is the fifth optimization presented in the “faster winter” series, please see that post for background information.)

Another good approach to performance turning is look at the code after GHC optimized it. So I planted a

{-# OPTIONS_GHC -ddump-simpl -dsuppress-coercions -dsuppress-unfoldings -dsuppress-module-prefixes #-}

at the top of Eval.hs, and looked through the code. This is not Haskell, but rather “GHC Core”, a simpler functional programming language that GHC uses internally. It is more verbose and less pretty, so it takes a bit of getting used to, but it’s nothing to be afraid of when you are a somewhat experienced Haskell developer.

There I found (much simplified) the following pattern:

step :: Instr -> Config -> IO Result
step e = case e of
  Add -> \c ->do stuff …
  Del -> \c ->do stuff …
  …

That’s bad! Yes, Haskell is a functional language, and passing around anonymous functions is very nice to write expressive code, and for most purposes it is not too slow … but in an inner loop, you really don’t want any such closures. So where did this come from? And as expected, the Haskell source did not have those inner lambdas. Instead, it was using a very innocent looking monad transformer:

step :: Instr -> ReaderT Config IO Result
step e = case e of
  Add -> do stuff …
  Del -> do stuff …
  …

A ReaderT r m a is just a different way of writing r -> m a that allows us to use do-notation or the monad combinators without having to pass the r around explicity, and as such it is indeed very convenient. But not as efficient as if we had written

step :: Instr -> Config -> IO Result
step e c = case e of
  Add ->do stuff …
  Del ->do stuff …
  …

where the step function takes two arguments right away, and no anonymous functions are created.

Why doesn’t our amazing Haskell compiler figure out that this would be better? Because it is not better in all situations: If we store step e :: ReaderT Config IO Result somewhere and and use it many times, with the same e but passing many different c :: Config, then we have to do the case e analysis only once. This can sometimes be better, so the compiler has to leave it in that form, in case we did it intentionally.

(Incidentially, the question of how to allow the compiler to eta-expand more functions seems to eternally haunt me, and its pursuit even led to a PhD thesis.

So how can we fix it? One relatively crude way is to shove it into the compiler face that we really want step to be a function with two parameters by wrapping the whole body in, well, a lambda.. But we still want to use the Reader monad in the body of step

So I came up with this:

step :: Instr -> ReaderT Config IO Result
step e = ReaderT $ \c -> ($ c) $ runReaderT $ case e of
  Add ->do stuff …
  Del ->do stuff …
  …

Now the \c -> is outside the case, the compiler adds it to the arguments of step and we get the code that we want (confirmed by a quick peek at the Core).

Improvement: Allocations: -23.20% Time: -23.00% (Commit f5a0dd2...894070f)

I used this pattern in more than once place, so I wanted to abstract it into a little helper definition. But that’s not so easy: If I just write

etaReaderT :: ReaderT r m a -> ReaderT r m a
etaReaderT m = ReaderT $ \c -> ($ c) $ runReaderT m

step :: Instr -> ReaderT Config IO Result
step e = etaReaderT $ case e of
  Add ->do stuff …
  Del ->do stuff …
  …

then the whole thing doesn't work any more! Because now, the case e is again “outside” the \c ->.

I whined on twitter about this, and Sebastian Graf reminded me helpfully of GHC.Exts.oneShot, a little magic function that was added to GHC 5 years ago … by some forgetful person: me.

If we use this in the right place inside etaReaderT it tells GHC in a soothing voice “hey! it’s ok! you can move this lambda out of cases. believe me. it’s gonna be ok”. And with this, it works:

etaReaderT :: ReaderT r m a -> ReaderT r m a
etaReaderT = ReaderT . oneShot . runReaderT

I wonder if this function would make a good addition to Control.Monad.Trans.Reader.

Incidentally, if you look at the code at the end of all my optimizations, there is no mention of etaReaderT any more: Subsequent optimizations simplified the code so much that eventually GHC was able to be able to do this transformation without my help.

Faster Winter 4: Export lists

Published 2019-11-21 in sections English, Haskell.

(This is the forth optimization presented in the “faster winter” series, please see that post for background information.)

This is on a funny one: You can speed up your code by adding export lists to your modules!

Why is that so?

Without an export, the compiler has to assume that every top-level function can possibly called from the outside, even functions that you think of as “internal”. If you have a function that you do not export, like instr, step_work and step after my change, the compiler can see all the places the function is called. If the function is only called in one place, it may inline it (copy its definition into where it is called), and simplify the code around the edges. And even if it does not inline the function, it might learn something about how the functions are used, and optimize them based on that (e.g. based on Demand Analysis).

Improvement: Allocations: -22.59% Memory: +0.00% Time: -11.79% (Commit bbe8af7...6f2ba09)

Besides being a performance win, an explicit export list is also very helpful to those who want to read or edit your code: they can refactor with greater ease while maintaining the external interface of the module.

Faster Winter 3: Difference Lists

Published 2019-11-20 in sections English, Haskell.

(This is the third optimization presented in the “faster winter” series, please see that post for background information.)

Today, we will tackle the big bad memory leak that eats up my laptop’s memory.

A quick look at the heap profile (+RTS -h) showed that the memory was filling up with lists. Not very helpful, as lists are everywhere. So I looked through the hot code of the interpreter, eval, step and instr in Wasm.Exec.Eval to see if anything fishy is going on. I found some uses of the list concatenation operator (++) – always a bad sign, as it has to traverse the list on its left completely!

And often the solution is pretty simple: Use difference lists! It’s even simpler than the name makes it sound like. It does not require you to import anything new, and works well everywhere where you assemble a list in multiple stages, but use it, in its full form, only once at the end. The trick is easy:

  • In the types, replace [t] with [t] -> [t] (or introduce an alias type DList a = [a] -> [a])
  • Replace [] with id
  • Replace [x] with (x:)
  • Replace xs ++ ys with xs . ys, if ys is also a difference list
  • Replace xs ++ ys with xs ys, if ys is a normal list
  • To turn the difference list xs into a normal list, write xs []

That’s it, and suddenly you have a list like data structure with constant-time concatenation!

Improvement: Allocations: -9.67% Memory: -99.08% Time: -47.36% (Commit 2e284f8...f9bbe8e)

Memory consumption is now at 60MB. I found some more places where I could use difference lists, and then it was finally at essentially zero, which is what you would expect for the program at hand:

Improvement: Allocations: -0.21% Memory: -96.83% Time: -2.91% (Commit f9bbe8e...bbe8af7)

To be honest, I am not actually sure why this fixed a space leak: Difference lists are not more memory efficient than normal lists! But maybe somehow these lists were more strictly evaluated once they were difference lists? Anyways, I was happy with the result and did not investigate further.

Faster Winter 2: SPECIALIZE

Published 2019-11-18 in sections English, Haskell.

(This is the second optimization presented in the “faster winter” series, please see that post for background information.)

The evaluator of the WebAssembly interpreter is highly polymorphic, as you can tell from the type signature of its main function:

type CEvalT f m a = ReaderT (Config f m) (ExceptT EvalError m) a
eval :: (Regioned f, MonadRef m, Show1 f)
     => Code f m -> CEvalT f m (Stack Value)

This mean the caller can choose whatever monad m this should use. This is very flexible, but it is also really hard to make that run fast: Every line in do notation is a call to the monadic bind operator (>>=), and since this operator is provided by the caller, the code of eval has to do do many calls to this unknown function.

GHC can do a much better job at optimizing this code if it knows which concrete monad is used, so that it knows the actual definition of that monad’s (>>=). Because then the compiler can inline it and make the whole monad overhead go away.

The wasm-invoke program actually uses the plain IO monad for m. We could now rewrite the whole module to not be polymorphic in m, but use IO instead. But that’s a lot of work and some user might genuinely want a different monad (e.g. ST s, or something else).

Luckily we can tell the compiler: “Let’s keep eval with this general type signature, but please also provide versions specifically for IO, and use that if possible”, by giving a SPECIALIZE pragma:

{-# SPECIALIZE eval ::
       Code Identity IO -> CEvalT Identity IO (Stack Value) #-}

This runs much faster now:

Improvement: Allocations: -48.65% Memory: +52.56% Time: -50.29% (Commit 4061fe6...2e284f8)

Again, as when we introduced mutable vectors, memory usage increases a bit, likely due to more stuff happening between garbage collections.

The code runs faster, but memory consumption is still absurd. So it is high time to hunt down the space leak, as the interpreter should not need any significant amounts of memory, and instead execute the program using a constant and low amount of memory. More about that in the next entry of this series.

Faster Winter 1: Vectors

Published 2019-11-18 in sections English, Haskell.

(This is the first optimization presented in the “faster winter” series, please see that post for background information.)

Immutable vectors

The first very low hanging fruit was to be found in the Wasm.Runtime.Memory module. The memory of the Wasm module was defined as follows:

import Data.Vector (Vector)

data MemoryInst m = MemoryInst
  { _miContent :: Mutable m (Vector Word8)
  , _miMax :: Maybe Size
  }

An immutable vector in a mutable box (Mutable m is something like IORef or STRef, whatever is appropriate for the monad m).

This means that every write to the memory has to copy the whole memory into a new vector. This is nice from a semantic point of view (e.g. you can take a snapshot of the memory at a specific point in time, evaluate the code some more, and the immutable vector is untouched), but obviously very inefficient.

Mutable vectors

So I changed it to a mutable vector (MVector) that allows in-place updates:

import Data.Vector (MVector)

data MemoryInst m = MemoryInst
  { _miContent :: Mutable m (MVector (PrimState m) Word8)
  , _miMax :: Maybe Size
  }

I still need to place that mutable vector in a mutable box, because an MVector still has fixed size but Wasm programs can resize their memory. This mean we still copy the whole memory when the memory grows, but not during normal writes.

Improvement: Allocations: -97.71% Memory: +16.87% Time: -98.88% (Commit af6adfc...254d700)

This is much more … usable!

Oddly, memory consumption increases,and I am not entirely sure why: My guess is that because it runs faster, the garbage collector runs less often, the program can allocate more stuff in between garbage collector runs.

Unboxed vectors

In a second step I also replaced the vector with an unboxed vector: Above we have the normal Data.Vector vector, which stores pointers (8 bytes!) to elements, even if the element is a single byte like here. By importing Data.Vector.Unboxed (no other change!) I get a nicely packed array of bytes.

Interestingly, while it improved maximum memory usage quite a little bit, it looks like it made the program actually a bit slower – potentially because it now had to box and unbox these Word8 values.

Improvement: Memory: -27.04% Time: +15.82% (4061fe6)

There may be more improvements possible here (e.g. when writing a Word32, maybe that can be done in one go, instead of writing the four bytes separately), but I did not follow this clue any further for now.

Instead, the next thing that looked like it was worth investigating was overly polymorphic code.

Winter is coming even more quickly

Published 2019-11-17 in sections English, Haskell.

TL;DR: I explain how I improved the performance of an interpreter for WebAssembly written in Haskell by plucking some low-hanging fruit.

Background

Motivated by my work at the DFINITY Foundation, I was looking into interpreters for WebAssembly written in Haskell, and found my colleague John Wiegley’s winter: A straight-forward port of the WebAssembly reference interpreter, written in Ocaml by Andreas Rossberg (another colleague of mine … I guess there is a common theme here.)

Obviously, an interpreter will never be as fast as a real compiler implementation (such as the one in v8, lucet or wasmtime). But for my purposes an interpreter is fine. Nevertheless, I don’t want it to be needlessly slow, and when I picked up wasm, it was clear that I had to work at least a little bit on performance.

None of this is to blame John, of course: I have no doubt that he could have sped it up at least as well! I assume it just wasn't a priority at that time, and winter is a nicely done piece of Haskell.

A graph

I like statistics and graphs! In this blog post, I am always running the following very simple C program:

unsigned int i;
void init () {
  i = 1024*1024;
  while (i--) {};
}

which I compile using clang version 9 and run with wasm-invoke from the winter package (compiled with GHC-8.6.5):

$ clang-9 --target=wasm32-unknown-unknown-wasm -fno-builtin -ffreestanding loop.c --compile
$ wasm-ld-9 loop.o --gc-sections --export init --no-entry -o loop.wasm
$ wasm-invoke -w loop.wasm -f init +RTS -t

The +RTS t tells the Haskell runtime to produce a line with some statistics, of which I collect

  • total number of byte allocated
  • maximum memory size used and
  • total time elapsed

To reduce the noise a little bit, I run the program five times, and take the minimum of all the runs.

I did this for all the commits I produce while trying to make the code faster. This yields the following graph:

The fruit of my labor

The fruit of my labor

Note that I had to draw this with a logarithmic y axis, because some of the improvements were so drastic that a linear graph would not be helpful.

But how did I get there? I will take you through a guided tour through all the commits that have improved performance, explain how I found an performance issue, what I did to improve things and why that improved things. To keep this post at a reasonable size, I split it into a series of blog post, faithful to the chronological order in which I performed them:

Here they are:

  1. Vectors
  2. SPECIALIZE
  3. Difference lists
  4. Export lists
  5. Eta-Expanding ReaderT
  6. Simpler code
  7. A Zipper
  8. Statistics (the making-of)

Was it worth it? I’d say so! Initially, running the program above would take 44 minutes (slow!) and continuously grows to require 6,7 GB of memory (Houston, we have a space leak!).

After all the steps above, memory consumption is constant and below 2MB, and the program finishes in 3,6s. Overall, this yields the following, very satisfying, statistics:

Improvement: Allocations: -99.60% Memory: -99.97% Time: -99.86% (Commits de9f4f8...5406efd)

By the way, the Ocaml reference interpreter takes 2,92s to run this (and it does a bit more, such as validation of the module). So there is still room for improvement…

ICFP 2019

Published 2019-08-24 in sections English, Haskell.

ICFP 2019 in Berlin ended yesterday, and it was – as always – a great pleasure. This year was particularly noteworthy for the quite affordable conference hotel and the absolutely amazing food during the coffee breaks.

Since I am no longer a proper academic, I unsurprisingly did not have real research to present. Luckily I found ways to not just be a passive participant this year:

  • At FARM, I presented Kaleidogen, a small game (or toy, some would say) of mine. The room was packed with people, so thanks for all your interest! If you missed it, you can soon see the recording or read the demo abstract.

  • At PLMW, the mentoring workshop for young researchers, I ran the “Social event” together with Niki Vazou. Like last year, we randomly grouped the students and held a little competition where they had to match program listings to languages and algorithms. This was great fun, and we even managed to solve the sudden problem of two ties in a ad-hoc extra quiz.

  • During his “State of GHC” speech, Simon Peyton Jones asked me to speak about the GHC Proposal Process for a few slides.

  • And since that is not enough stage time, I secured two spots in local stand-up comedy open mics on Monday and Friday, and even dragged sizable crowds of ICFP participants to these venues. One was a boat, and the other one a pretty dodgy bar in Neukölln, so that alone was a memorable experience. And the host was visibly surprised when his joke “I couldn’t be a software developers – I can’t commit” was met by such a roaring response…

Anyways, ICFP is over, back to disappear in the churn of every day work, and I hope to see you all next year.