Joachim Breitner

Blog

A Telegram bot in Haskell on Amazon Lambda

Published 2020-04-06 in sections English, Haskell.

I just had a weekend full of very successful serious geekery. On a whim I thought: “Wouldn't it be nice if people could interact with my game Kaleidogen also via a telegram bot?” This led me to learn about how I write a Telegram bot in Haskell and how I can deploy such a Haskell program to Amazon Lambda. In particular the latter bit might be interesting to some of my readers, so here is how went about it.

Kaleidogen

Kaleidogen is a little contemplative game (or toy game where, starting from just unicolored disks, you combine abstract circular patterns to breed more interesting patterns. See my FARM 2019 talk for more details, or check out the source repository. BTW, I am looking for help turning it into an Android app!

KaleidogenBot in action

KaleidogenBot in action

Amazon Lambda

Amazon Lambda is the “Function as a service” offering of Amazon Web Services. The idea is that you don’t rent a server, where you have to deal with managing the whole system and that you are paying for constantly, but you just upload the code that responds to outside requests, and AWS takes care of the rest: Starting and stopping instances, providing a secure base system etc. When nobody is using the service, no cost occurs.

This sounds ideal for hosting a toy Telegram bot: Most of the time nobody will be using it, and I really don't want to have to baby sit yet another service on my server. On Amazon Lambda, I can probably just forget about it.

But Haskell is not one of the officially supported languages on Amazon Lambda. So to run Haskell on Lambda, one has to solve two problems:

  • how to invoke the Haskell code on the server, and
  • how to build Haskell so that it runs on the Amazon Linux distribution

A Haskell runtime for Lambda

For the first we need a custom runtime. While this sounds complicated, it is actually a pretty simple concept: A runtime is an executable called bootstrap that queries the Lambda Runtime Interface for the next request to handle. The Lambda documentation is phrased as if this runtime has to be a dispatcher that calls the separate function’s handler. But it could just do all the things directly.

I found the Haskell package aws-lambda-haskell-runtime which provides precisely that: A function

runLambda :: (LambdaOptions -> IO (Either String LambdaResult)) -> IO ()

that talks to the Lambda Runtime API and invokes its argument on each message. The package also provides Template Haskell magic to collect “handlers“ of any JSON-able type and generates a dispatcher, like you might expect from other, more dynamic languages. But that was too much magic for me, so I ignored that and just wrote the handler manually:

main :: IO ()
main = runLambda run
  where
   run ::  LambdaOptions -> IO (Either String LambdaResult)
   run opts = do
    result <- handler (decodeObj (eventObject opts)) (decodeObj (contextObject opts))
    either (pure . Left . encodeObj) (pure . Right . LambdaResult . encodeObj) result

data Event = Event
  { path :: T.Text
  , body :: Maybe T.Text
  } deriving (Generic, FromJSON)

data Response = Response
  { statusCode :: Int
  , headers :: Value
  , body :: T.Text
  , isBase64Encoded :: Bool
  } deriving (Generic, ToJSON)

handler :: Event -> Context -> IO (Either String Response)
handler Event{body, path} context =

I expose my Lambda function to the world via Amazon’s API Gateway, configured to just proxy the HTTP requests. This means that my code receives a JSON data structure describing the HTTP request (here called Event, listing only the fields I care about), and it will respond with a Response, again as JSON.

The handler can then simply pattern-match on the path to decide what to do. For example this code handles URLs like /img/CAFFEEFACE.png, and responds with an image.

handler :: TC -> Event -> Context -> IO (Either String Response)
handler Event{body, path} context
    | Just bytes <- isImgPath path >>= T.decodeHex = do
        let pngData = genPurePNG bytes
        pure $ Right Response
            { statusCode = 200
            , headers = object [ "Content-Type" .= ("image/png" :: String) ]
            , isBase64Encoded = True
            , body = T.decodeUtf8 $ LBS.toStrict $ Base64.encode pngData
            }
    …

isImgPath :: T.Text -> Maybe T.Text
isImgPath  = T.stripPrefix "/img/" >=> T.stripSuffix ".png"

If this program would grow more, then one should probably use something more structured for routing here; maybe servant, or bridging towards wai apps (amost like wai-lamda, but that still assumes an existing runtime, instead of simply being the runtime). But for my purposes, no extra layers of indirection or abstraction are needed!

Deploying Haskell to Lambda

Building Haskell locally and deploying to different machiens is notoriously tricky; you often end up depending on a shared library that is not available on the other platform. The aws-lambda-haskell-runtime package, and similar projects like serverless-haskell, solve this using stack and Docker – two technologies that are probably great, but I never warmed up to them.

So instead adding layers and complexities, can I solve this instead my making things simpler? If I compiler my bootstrap into a static Linux binary, it should run on any Linux, including Amazon Linux.

Unfortunately, building Haskell programs statically is also notoriously tricky. But it is made much simpler by the work of Niklas Hambüchen and others in the context of the Nix package manager, coordinated in the static-haskell-nix project. The promise here is that once you have set up building your project with Nix, then getting a static version is just one flag away. The support is not completely upstreamed into nixpkgs proper yet, but their repository has a nix file that contains a nixpkgs set with their patches:

let pkgs = (import (sources.nixpkgs-static + "/survey/default.nix") {}).pkgs; in

This, plus a fairly standard nix setup to build the package, yields what I was hoping for:

$ nix-build -A kaleidogen
/nix/store/ppwyq4d964ahd6k56wsklh93vzw07ln0-kaleidogen-0.1.0.0
$ file result/bin/kaleidogen-amazon-lambda
result/bin/kaleidogen-amazon-lambda: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
$ ls -sh result/bin/kaleidogen-amazon-lambda
6,7M result/bin/kaleidogen-amazon-lambda

If we put this file, named bootstrap, into a zip file and upload it to Amazon Lambda, then it just works! Creating the zip file is easily scripted using nix:

  function-zip = pkgs.runCommandNoCC "kaleidogen-lambda" {
    buildInputs = [ pkgs.zip ];
  } ''
    mkdir -p $out
    cp ${kaleidogen}/bin/kaleidogen-amazon-lambda bootstrap
    zip $out/function.zip bootstrap
  '';

So to upload this, I use this one-liner (line-wrapped for your convenience):

nix-build -A function-zip &&
aws lambda update-function-code --function-name kaleidogen \
  --zip-file fileb://result/function.zip

Thanks to how Nix pins all dependencies, I am fairly confident that I can return to this project in 4 months and still be able to build it.

Of course, I want continuous integration and deployment. So I build the project with GitHub Actions, using a cachix nix cache to significantly speed up the build, and auto-deploy to Lambda using aws-lambda-deploy; see my workflow file for details.

The Telegram part

The above allows me to run basically any stateless service, and a Telegram bot is nothing else: When configured to act as a WebHook, Telegram will send a request with a message to our Lambda function, where we can react on it.

The telegram-api package provides bindigs for the Telegram Bot API (although I had to use the repository version, as the version on Hackage has some bitrot). Slightly simplified I can write a handler for an Update:

handleUpdate :: Update -> TelegramClient ()
handleUpdate Update{ message = Just m } = do
  let c = ChatId (chat_id (chat m))
  liftIO $ printf "message from %s: %s\n" (maybe "?" user_first_name (from m)) (maybe "" T.unpack (text m))
  if "/start" `T.isPrefixOf` fromMaybe "" (text m)
  then do
    rm <- sendMessageM $ sendMessageRequest c "Hi! I am @KaleidogenBot. …"
    return ()
  else do
    m1 <- sendMessageM $ sendMessageRequest c "One moment…"
    withPNGFile  $ \pngFN -> do
      m2 <- uploadPhotoM $ uploadPhotoRequest c
        (FileUpload (Just "image/png") (FileUploadFile pngFN))
      return ()
handleUpdate _ u =
  liftIO $ putStrLn $ "Unhandled message: " ++ show u

and call this from the handler that I wrote above:

    …
    | path == "/telegram" =
      case eitherDecode (LBS.fromStrict (T.encodeUtf8 (fromMaybe "" body))) of
        Left err -> …
        Right update -> do
          runTelegramClient token manager $ handleUpdate Nothing update
          pure $ Right Response
            { statusCode = 200
            , headers = object [ "Content-Type" .= ("text/plain" :: String) ]
            , isBase64Encoded = False
            , body = "Done"
            }
    …

Note that the Lambda code receives the request as JSON data structure with a body that contains the original HTTP request body. Which, in this case, is itself JSON, so we have to decode that.

All that is left to do is to tell Telegram where this code lives:

curl --request POST \
  --url https://api.telegram.org/bot<token>/setWebhook
  --header 'content-type: application/json'
  --data '{"url": "https://api.kaleidogen.nomeata.de/telegram"}'

As a little add on, I also created a Telegram game for Kaleidogen. A Telegram game is nothing but a webpage that runs inside Telegram, so it wasn’t much work to wrap the Web version of Kaleidogen that way, but the resulting Telegram game (which you can access via https://core.telegram.org/bots/games) still looks pretty neat.

No /dev/dri/renderD128

I am mostly happy with this setup: My game is now available to more people in more ways. I don’t have to maintain any infrastructure. When nobody is using this bot no resources are wasted, and the costs of the service are neglectible -- this is unlikely to go beyond the free tier, and even if it would, the cost per generated image is roughly USD 0.000021.

There is one slight disappointment, though. What I find most intersting about Kaleidogen from a technical point of view is that when you play it in the browser, the images are not generated by my code. Instead, my code creates a WebGL shader program on the fly, and that program generates the image on your graphics card.

I even managed to make the GL rendering code work headlessly, i.e. from a command line program, using EGL and libgbm and a helper written in C. But it needs access to a graphics card via /dev/dri/renderD128. Amazon does not provide that to Lambda code, and neither do the other big Function-as-a-service providers. So I had to swallow my pride and reimplement the rendering in pure Haskell.

So if you think the bot is kinda slow, then that’s why. Despite properly optimizing the pure implementation (the inner loop does not do allocations and deals only with unboxed Double# values), the GL shader version is still three times as fast. Maybe in a few years GPU access will be so ubiquitous that it’s even on Amazon Lambda; then I can easily use that.

30 years of Haskell

Published 2020-04-01 in sections English, Haskell.

Vitaly Bragilevsky, in a mail to the GHC Steering Committee, reminded me that the first version of the Haskell programming language was released exactly 30 years ago. On April 1st. So that raises the question: Was Haskell just an April fool's joke that was never retracted?

The cover of the 1.0 Haskell report

The cover of the 1.0 Haskell report

My own first exposure to Haskell was in April 2005; the oldest piece of Haskell I could find on my machine is this part of a university assignment from April:

> pascal 1 = [1]
> pascal (n+1) = zipWith (+) (x ++ [0]) (0 : x) where x = pascal n

This means that I now have witnessed half of Haskell's existence. I have never regretted getting into Haskell, and every time I come back from having worked in other languages (which all have their merits too), I greatly enjoy the beauty and elegance of expressing my ideas in a lazy and strictly typed language with a concise syntax.

I am looking forward to witnessing (and, to a very small degree, shaping) the next 15 years of Haskell.

Animations in Kaleidogen

Published 2020-03-31 in sections English, Haskell.

A while ago I wrote a little game (or toy) called Kaleidogen. It is a relatively contemplative game where, starting from just unicolored disks, you combine abstract circular patterns to breed more interesting patterns. See my FARM 2019 talk for more details, or check out the source repository.

It has mostly been quiet with this game, but I finally got around to add a little bit of animation: When you have bred one of these patterns, you can animate its genesis, from nothing to a complex patterns, as you can see in this screencast:

Kaleidogen, animated

By the way: I am looking for collaborators who help me to get this into the Play Store properly, so let me know if you want to play around with Haskell, Android, Nix, OpenGL and cross-compilation.

git post-squash

Published 2020-02-03 in sections English, Digital World.

I wrote a little git tool that helps in an environment where PRs are merged using squash merges, but you still want to deal with feature branches properly.

What is a squash merge?

One popular workflow involving Git and Github is a squash-merge based workflow: You develop your feature on a feature branch (say featureA), adding commits as you go, possibly merging from master a few times:

M1 ─ M2 ─────────── M3 ────── M4            (master)
       ╲              ╲
        A1 ─ A2 ─ A3 ─ A4 ─ A5              (featureA)

When the feature is ready, you merge master into featureA a last time (e.g. to check on your CI infrastructure that this merge does not break the build):

M1 ─ M2 ─────────── M3 ────── M4            (master)
       ╲              ╲         ╲
        A1 ─ A2 ─ A3 ─ A4 ─ A5 ─ A6         (featureA)

and now you do a squash merge (or use Github’s green “Squash merge” button, or mergify.io’s squash merge action). The result is a new commit M5 on master that contains all the changes from featureA:

M1 ─ M2 ─────────── M3 ────── M4 ─ M5       (master)
       ╲              ╲         ╲
        A1 ─ A2 ─ A3 ─ A4 ─ A5 ─ A6         (featureA)

Note that there is no line from A6 to M5. This means that the git history of master is clean, and does not contain the usually boring and unhelpful history of how featureA came to be; no “fix typo” commits, no “merge master into featureA” commit.

But the downside is that, as far as git is concerned, this commit is totally unrelated to the featureA branch. This is not a problem as long as featureA lives on its own. But it becomes a problem if there are feature branches building off featureA:

What is the problem with squash merge?

Consider the situation above, but add featureB to the mix, a feature branch that was created off featureA:

M1 ─ M2 ─────────── M3 ────── M4 ── M5      (master)
       ╲              ╲         ╲
        A1 ─ A2 ─ A3 ─ A4 ─ A5 ─ A6         (featureA)
               ╲         ╲
                B1 ────── B2 ─ B3           (featureB)

We now want to bring the latest changes from featureA and master into featureB. Merging featureA into featureB is straight-forward:

M1 ─ M2 ─────────── M3 ────── M4 ── M5      (master)
       ╲              ╲         ╲
        A1 ─ A2 ─ A3 ─ A4 ─ A5 ─ A6         (featureA)
               ╲         ╲         ╲
                B1 ────── B2 ─ B3 ─ B4      (featureB)

But if we run git merge master now, we are likely running into very unfortunate git conflicts. Because to git, M5 is unrelated to featureA, it does not know that all the changes already have been merged into featureB when we created the merge commit B4!

But we know that M5 contains nothing that isn’t already in featureB, because M5 was a squash commit of A6.

The manual way of resolving this is to run

git merge -s ours master

which tells git: Pretend that we merged master into this, but don’t actually touch any of the files, everything on the current branch is already in the form we want. This way, we get

M1 ─ M2 ─────────── M3 ────── M4 ── M5      (master)
       ╲              ╲         ╲     ╲
        A1 ─ A2 ─ A3 ─ A4 ─ A5 ─ A6    ╲    (featureA)
               ╲         ╲         ╲    ╲
                B1 ────── B2 ─ B3 ─ B4 ─ B5 (featureB)

Note: git merge -s ours is not the same as the git merge -X ours! See the manpage for git merge for details.

How does git-post-squash help?

While the manual way works, one needs to be careful: If master has progressed further, or if featureA was not fully up-to-date before the squash merge, using git merge -s ours will easily and silently undo changes that were already committed to master.

So instead run

git post-squash master

which will do git merge -s ours, but it will

  • find the right commit on master to use (it may not be the latest) and
  • double-check that nothing is lost.

It does so by picking the latest commit on master that has the same tree as some commit on the current branch that is not yet on master.

In the example above, it would pick M5 because it has the same tree as A6, which is a commit that exists on featureB, but not on master.

Convinced? Go and get it on https://github.com/nomeata/git-post-squash.

Faster Winter: Statistics (the making-of)

Published 2019-11-24 in sections English, Haskell.

(This is an appendix to the “faster winter” series, please see that post for background information.)

Did you like the graph and the stats that I produced? Just for completeness, I am including the various scripts I used. Nothing super exciting to see here, but maybe someone finds this useful.

This little shell one-liner collects the run-time statistics for each commit in the interesting range (line-wrapped for your convenience):

for h in $(git log 1cea7652f48fad348af914cb6a56b39f8dd99c6a^..5406efd9e057aebdcf94d14b1bc6b5469454faf3 --format=%H)
do
  echo -n "$h"
  git checkout -q "$h"
  cabal new-build -v0
  echo -n ":"
  rm -f stats/$h.txt
  for i in $(seq 1 5)
  do
    cabal -v0 new-run exe:wasm-invoke -- -w loop.wasm  -f canister_init +RTS -t >/dev/null 2>> stats/$h.txt
    echo -n .
  done
  echo
done

A small Perl script takes the minimum for each measurement across the five runs, and produces a CSV file:

#!/usr/bin/perl

use List::Util qw(min);

my @alloc;
my @in_use;
my @time;

while (<>) {
  m!<<ghc: (\d+) bytes, \d+ GCs, \d+/\d+ avg/max bytes residency \(\d+ samples\), (\d+)M in use, [\d.]+ INIT \(([\d.]+) elapsed\), [\d.]+ MUT \(([\d.]+) elapsed\), [\d.]+ GC \(([\d.]+) elapsed\) :ghc>>! or die $!;
  push @alloc, 0+$1;
  push @in_use, $2;
  push @time, $3+$4+$5;
}

printf "%d;%d;%f\n", min(@alloc), min(@in_use), min(@time);

To create a full file for all the commits in the range that have files, I used this bash one-liner (again line-wrapped for your convenience):

echo 'commit;allocations;memory;time' > stats.csv
for h in $(git log 1cea7652f48fad348af914cb6a56b39f8dd99c6a^..5406efd9e057aebdcf94d14b1bc6b5469454faf3 --format=%H|tac)
do
  git log -n 1 --oneline $h
  test -f stats/$h.txt && echo "$(echo $h|cut -c-7);$(./tally.pl < stats/$h.txt)" | tee -a stats.csv
done

The stats can be turned into the graphc using pgfplots by compiling this LaTeX file:

\documentclass[class=minimal]{standalone}
\usepackage{mathpazo}
\usepackage{pgfplots}
\definecolor{skyblue1}{rgb}{0.447,0.624,0.812}
\definecolor{scarletred1}{rgb}{0.937,0.161,0.161}
\pgfplotsset{width=12cm,compat=newest}

% From https://tex.stackexchange.com/a/63340/15107
\makeatletter
\pgfplotsset{
    /pgfplots/flexible xticklabels from table/.code n args={3}{%
        \pgfplotstableread[#3]{#1}\coordinate@table
        \pgfplotstablegetcolumn{#2}\of{\coordinate@table}\to\pgfplots@xticklabels
        \let\pgfplots@xticklabel=\pgfplots@user@ticklabel@list@x
    }
}
\makeatother

\begin{document}
\begin{tikzpicture}

\pgfplotsset{every axis/.style={ymin=0}}
\begin{semilogyaxis}[
  skyblue1,
  scale only axis,
  axis y line*=left,
  ylabel=Allocation (bytes),
  flexible xticklabels from table={stats.csv}{[index]0}{col sep=semicolon},
  xticklabel style={rotate=90, anchor=east, text height=1.5ex, font=\ttfamily, color=black},
  xtick=data,
  ]
\addplot[const plot mark mid, color=skyblue1]
  table [x expr=\coordindex+1, y index=1, col sep=semicolon] {stats.csv};
\end{semilogyaxis}

\begin{semilogyaxis}[
  green,
  scale only axis,
  axis y line*=right,
  ylabel=Memory (MB),
  x tick style={draw=none},
  xtick=\empty,
  ]
\addplot[const plot mark mid, color=green]
  table [x expr=\coordindex+1, y index=2, col sep=semicolon] {stats.csv};
\end{semilogyaxis}


\begin{semilogyaxis}[
  red,
  scale only axis,
  axis y line*=right,
  ylabel=Time (seconds),
  x tick style={draw=none},
  xtick=\empty,
  ]
\pgfplotsset{every outer y axis line/.style={xshift=2cm}, every tick/.style={xshift=2cm}, every y tick label/.style={xshift=2cm} }
\addplot[const plot mark mid, color=red]
  table [x expr=\coordindex+1, y index=3, col sep=semicolon] {stats.csv};
\end{semilogyaxis}
\end{tikzpicture}
\end{document}

And finally this Perl script allows me to paste any two lines from the CSV file and produces appropriate Markdown for the “improvement” lines in my posts:

#!/usr/bin/perl

my $first = 1;

my $commit;
my $alloc;
my $in_use;
my $time;

while (<>) {
  /(.*);(.*);(.*);(.*)/ or die;
  unless ($first) {
    printf "**Improvement**: Allocations: %+.2f%%  Memory: %+.2f%%  Time: %+.2f%% (Commit [%s...%s](http://github.com/dfinity/winter/compare/%s...%s))\n",
      (100 * ($2/$alloc - 1)),
      (100 * ($3/$in_use - 1)),
      (100 * ($4/$time - 1)),
      $commit,
      $1,
      $commit,
      $1;
  }
  $first = 0;
  $commit = $1;
  $alloc = $2;
  $in_use = $3;
  $time = $4;
}

Faster Winter 7: The Zipper

Published 2019-11-24 in sections English, Haskell.

(This is the seventh optimization presented in the “faster winter” series, please see that post for background information.)

The last bit of performance could be considered a domain-specific optimization, as one might describe it as “introducing a control stack to the interpreter”. But in a different light, one could consider it the introduction of a more appropriate data structure, by applying a “zipper”-like transformation to the existing data structure.

The problem is that the state of the system (datatype Code) is a stack of values and a stack of instructions. Simplifying this for this post, we might have

data Code = Code [Value] [Instr]
data Instr
  = Const Value | Add | Sub | Return
  | Labled Int Code

The interpreter gets such a Code, but does not always just execute the top-most instruction: If that is a Labeled instruction, it has to execute the next instruction in the argument of that Labeled. (It is not very important at this point to understand what a Labeled is used for). So we might have a Code that looks like the following:

c1 = Code [] [Labeled 1 (Code [2] [Labeled 2 (Code [3,4] [Add]), Sub]), Return]

The next instruction to execute is actually the Add. But in order to find that, the function step has to look under the first Labeled, look under the second Labeled, then execute step (Code [3,4] [Add]) = Code [7] [], and finally wrap it again in the two Labeled, to produce:

c2 = Code [] [Labeled 1 (Code [2] [Labeled 2 (Code [7] []), Sub]), Return]

Then eval calls step again, and step has to look inside the Labeled again to find the next instruction to execute.

It would be easier if the next instruction to execute would be presented to step more conveniently, right as a field of Code. But in order to do this, we have to move the Labeled entries “out of the way”. I do that by adding a new, first parameter to Code where I keep a stack of all the Label constructor that were in the way, in reverse order. So the c1 above would now be

data Code = Code Control [Value] Instr
data Control = Empty | Labeled Int Code

c1' = Code (Labeled 2 (Code (Labeled 1 (Code Empty [] [Return])) [2] [Sub]) [3,4] [Add]

Can you see how this relates to c1 above? The important piece here is that the interpreter finds the next instruction to execute always at the head of the instruction list right of the outermost code, and as long as there is something to execute there, it doesn't have to touch the Control stack at all.

This required touching some more lines of code than the previous optimizations, but doing so didn't require much creativity, as the old and new Code types are in clear correspondance, and that guides me in how to use adjust the code. And it’s certainly worth it:

Improvement: Allocations: -46.17% Time: -36.88% (Commit e66f1e0...57f3e9d)

Faster Winter 6: Simpler Code

Published 2019-11-24 in sections English, Haskell.

(This is the sixth optimization presented in the “faster winter” series, please see that post for background information.)

The Wasm reference interpreter has a function step that performs one single one step of evaluation, by taking the state before (in a type called code), and returning the state after the step. The function eval calls step, checks if the result is “done” (no instructions left to do), and if it is not done, recursively calls eval again. This way, evaluation progresses step by step until it is done.

The Haskell port follows the same pattern of a single-step step and a driver eval, but it chose to write the code continuation passing style: Instead of returning, the function step takes a function that it passes the result to. So the code looks like this (slightly simplified):

type CEvalT m a = ReaderT Config (ExceptT EvalError m) a

step :: PrimMonad => Code -> (Code -> CEvalT m r) -> CEvalT m r
step c k =  … k new_c …

eval :: PrimMonad => Code -> CEvalT m (Stack Value)
eval c =
    if is_done c
    then stack c
    else step c eval

There must have been a good reason to prefer this style over the plain style, but I was curious if it was really helpful. So I changed it to the following, more straight-forward code:

type CEvalT m a = ReaderT Config (ExceptT EvalError m) a

step :: PrimMonad => Code -> CEvalT m Code
step c k =  … new_c …

eval :: PrimMonad => Code f m -> CEvalT f m (Stack Value)
eval c =
    if is_done c
    then stack c
    else step c >>= eval

And indeed, the simpler code worked better:

Improvement: Allocations: -9.6 Time: -16.91% (Commit eb8add3...e66f1e0)

Again, avoiding function values (as we construct them as the contination closure) might have helped here.

Or maybe the simpler code allowed GHC to notice that the Code value, which is simply a tuple with a different name, is constructed by step but immediatelly deconstructed by eval, and thus GHC could optimize that away (by “unboxing” the argument and/or result of step and simply passing the components of the tuple).

And finally, independent of all the performance questions, it also makes the code easier to understand.

Faster Winter 5: Eta-Expanding ReaderT

Published 2019-11-22 in sections English, Haskell.

(This is the fifth optimization presented in the “faster winter” series, please see that post for background information.)

Another good approach to performance turning is look at the code after GHC optimized it. So I planted a

{-# OPTIONS_GHC -ddump-simpl -dsuppress-coercions -dsuppress-unfoldings -dsuppress-module-prefixes #-}

at the top of Eval.hs, and looked through the code. This is not Haskell, but rather “GHC Core”, a simpler functional programming language that GHC uses internally. It is more verbose and less pretty, so it takes a bit of getting used to, but it’s nothing to be afraid of when you are a somewhat experienced Haskell developer.

There I found (much simplified) the following pattern:

step :: Instr -> Config -> IO Result
step e = case e of
  Add -> \c ->do stuff …
  Del -> \c ->do stuff …
  …

That’s bad! Yes, Haskell is a functional language, and passing around anonymous functions is very nice to write expressive code, and for most purposes it is not too slow … but in an inner loop, you really don’t want any such closures. So where did this come from? And as expected, the Haskell source did not have those inner lambdas. Instead, it was using a very innocent looking monad transformer:

step :: Instr -> ReaderT Config IO Result
step e = case e of
  Add -> do stuff …
  Del -> do stuff …
  …

A ReaderT r m a is just a different way of writing r -> m a that allows us to use do-notation or the monad combinators without having to pass the r around explicity, and as such it is indeed very convenient. But not as efficient as if we had written

step :: Instr -> Config -> IO Result
step e c = case e of
  Add ->do stuff …
  Del ->do stuff …
  …

where the step function takes two arguments right away, and no anonymous functions are created.

Why doesn’t our amazing Haskell compiler figure out that this would be better? Because it is not better in all situations: If we store step e :: ReaderT Config IO Result somewhere and and use it many times, with the same e but passing many different c :: Config, then we have to do the case e analysis only once. This can sometimes be better, so the compiler has to leave it in that form, in case we did it intentionally.

(Incidentially, the question of how to allow the compiler to eta-expand more functions seems to eternally haunt me, and its pursuit even led to a PhD thesis.

So how can we fix it? One relatively crude way is to shove it into the compiler face that we really want step to be a function with two parameters by wrapping the whole body in, well, a lambda.. But we still want to use the Reader monad in the body of step

So I came up with this:

step :: Instr -> ReaderT Config IO Result
step e = ReaderT $ \c -> ($ c) $ runReaderT $ case e of
  Add ->do stuff …
  Del ->do stuff …
  …

Now the \c -> is outside the case, the compiler adds it to the arguments of step and we get the code that we want (confirmed by a quick peek at the Core).

Improvement: Allocations: -23.20% Time: -23.00% (Commit f5a0dd2...894070f)

I used this pattern in more than once place, so I wanted to abstract it into a little helper definition. But that’s not so easy: If I just write

etaReaderT :: ReaderT r m a -> ReaderT r m a
etaReaderT m = ReaderT $ \c -> ($ c) $ runReaderT m

step :: Instr -> ReaderT Config IO Result
step e = etaReaderT $ case e of
  Add ->do stuff …
  Del ->do stuff …
  …

then the whole thing doesn't work any more! Because now, the case e is again “outside” the \c ->.

I whined on twitter about this, and Sebastian Graf reminded me helpfully of GHC.Exts.oneShot, a little magic function that was added to GHC 5 years ago … by some forgetful person: me.

If we use this in the right place inside etaReaderT it tells GHC in a soothing voice “hey! it’s ok! you can move this lambda out of cases. believe me. it’s gonna be ok”. And with this, it works:

etaReaderT :: ReaderT r m a -> ReaderT r m a
etaReaderT = ReaderT . oneShot . runReaderT

I wonder if this function would make a good addition to Control.Monad.Trans.Reader.

Incidentally, if you look at the code at the end of all my optimizations, there is no mention of etaReaderT any more: Subsequent optimizations simplified the code so much that eventually GHC was able to do this transformation without my help.

Faster Winter 4: Export lists

Published 2019-11-21 in sections English, Haskell.

(This is the forth optimization presented in the “faster winter” series, please see that post for background information.)

This is on a funny one: You can speed up your code by adding export lists to your modules!

Why is that so?

Without an export, the compiler has to assume that every top-level function can possibly called from the outside, even functions that you think of as “internal”. If you have a function that you do not export, like instr, step_work and step after my change, the compiler can see all the places the function is called. If the function is only called in one place, it may inline it (copy its definition into where it is called), and simplify the code around the edges. And even if it does not inline the function, it might learn something about how the functions are used, and optimize them based on that (e.g. based on Demand Analysis).

Improvement: Allocations: -22.59% Memory: +0.00% Time: -11.79% (Commit bbe8af7...6f2ba09)

Besides being a performance win, an explicit export list is also very helpful to those who want to read or edit your code: they can refactor with greater ease while maintaining the external interface of the module.

Faster Winter 3: Difference Lists

Published 2019-11-20 in sections English, Haskell.

(This is the third optimization presented in the “faster winter” series, please see that post for background information.)

Today, we will tackle the big bad memory leak that eats up my laptop’s memory.

A quick look at the heap profile (+RTS -h) showed that the memory was filling up with lists. Not very helpful, as lists are everywhere. So I looked through the hot code of the interpreter, eval, step and instr in Wasm.Exec.Eval to see if anything fishy is going on. I found some uses of the list concatenation operator (++) – always a bad sign, as it has to traverse the list on its left completely!

And often the solution is pretty simple: Use difference lists! It’s even simpler than the name makes it sound like. It does not require you to import anything new, and works well everywhere where you assemble a list in multiple stages, but use it, in its full form, only once at the end. The trick is easy:

  • In the types, replace [t] with [t] -> [t] (or introduce an alias type DList a = [a] -> [a])
  • Replace [] with id
  • Replace [x] with (x:)
  • Replace xs ++ ys with xs . ys, if ys is also a difference list
  • Replace xs ++ ys with xs ys, if ys is a normal list
  • To turn the difference list xs into a normal list, write xs []

That’s it, and suddenly you have a list like data structure with constant-time concatenation!

Improvement: Allocations: -9.67% Memory: -99.08% Time: -47.36% (Commit 2e284f8...f9bbe8e)

Memory consumption is now at 60MB. I found some more places where I could use difference lists, and then it was finally at essentially zero, which is what you would expect for the program at hand:

Improvement: Allocations: -0.21% Memory: -96.83% Time: -2.91% (Commit f9bbe8e...bbe8af7)

To be honest, I am not actually sure why this fixed a space leak: Difference lists are not more memory efficient than normal lists! But maybe somehow these lists were more strictly evaluated once they were difference lists? Anyways, I was happy with the result and did not investigate further.