Joachim Breitner

Blog

Thoughts on bootstrapping GHC

Published 2018-12-13 in sections English, Haskell.

I am returning from the reproducible builds summit 2018 in Paris. The latest hottest thing within the reproducible-builds project seems to be bootstrapping: How can we build a whole operating system from just and only source code, using very little, or even no, binary seeds or auto-generated files. This is actually concern that is somewhat orthogonal to reproducibility: Bootstrappable builds help me in trusting programs that I built, while reproducible builds help me in trusting programs that others built.

And while they make good progress bootstrapping a full system from just a C compiler written in Scheme, and a Scheme interpreter written in C, that can build each other (Janneke’s mes project), and there are plans to build that on top of stage0, which starts with a 280 bytes of binary, the situation looks pretty bad when it comes to Haskell.

Unreachable GHC

The problem is that contemporary Haskell has only one viable implementation, GHC. And GHC, written in contemporary Haskell, needs GHC to be build. So essentially everybody out there either just downloads a binary distribution of GHC. Or they build GHC from source, using a possibly older (but not much older) version of GHC that they already have. Even distributions like Debian do nothing different: When they build the GHC package, the builders use, well, the GHC package.

There are other Haskell implementations out there. But if they are mature and active developed, then they are implemented in Haskell themselves, often even using advanced features that only GHC provides. And even those are insufficient to build GHC itself, let alone the some old and abandoned Haskell implementations.

In all these cases, at some point an untrusted binary is used. This is very unsatisfying. What can we do? I don’t have the answers, but please allow me to outline some venues of attack.

Retracing history

Obviously, even GHC does not exist since the beginning of time, and the first versions surely were built using something else than GHC. The oldest version of GHC for which we can find a release on the GHC web page is version 0.29 from July 1996. But the installation instructions write:

GHC 0.26 doesn't build with HBC. (It could, but we haven't put in the effort to maintain it.)

GHC 0.26 is best built with itself, GHC 0.26. We heartily recommend it. GHC 0.26 can certainly be built with GHC 0.23 or 0.24, and with some earlier versions, with some effort.

GHC has never been built with compilers other than GHC and HBC.

So it seems that besides GHC, only ever HBC was used to compiler GHC. HBC is a Haskell compiler where we find the sources of one random version only thanks to archive.org. Parts of it are written in C, so I looked into this: Compile HBC, use it to compile GHC-0.29, and then step for step build every (major) version of GHC until today.

The problem is that it is non-trivial to build software from the 90s using today's compilers. I briefly looked at the HBC code base, and had to change some files from using varargs.h to stdargs.v, and this is surely just one of many similar stumbling blocks trying to build that tools. Oh, and even the hbc source state

# To get everything done: make universe
# It is impossible to make from scratch.
# You must have a running lmlc, to
# recompile it (of course).

So I learned that actually, most of it is written in LML, and the LML compiler is written in LML. So this is a dead end. (Thanks to Lennart for clearing up a misunderstanding on my side here.

Going back, but doing it differently

Another approach is to go back in time, to some old version of GHC, but maybe not all the way to the beginning, and then try to use another, officially unsupported, Haskell compiler to build GHC. This is what rekado tried to do in 2017: He use the most contemporary implementation of Haskell in C, the Hugs interpreter. Using this, he compiled nhc98 (yet another abandoned Haskell implementation), with the hope of building GHC with nhc98. He made impressive progress back then, but ran into a problem where the runtime crashed. Maybe someone is interested in picking up up from there?

Removing, simplifying, extending, in the present.

Both approaches so far focus on building an old version of GHC. This adds complexity: other tools (the shell, make, yacc etc.) may behave different now in a way that causes hard to debug problems. So maybe it is more fun and more rewarding to focus on today’s GHC? (At this point I am starting to hypothesize).

I said before that no other existing Haskell implementation can compile today’s GHC code base, because of features like mutually recursive modules, the foreign function interface etc. And also other existing Haskell implementations often come with a different, smaller set of standard libraries, but GHC assumes base, so we would have to build that as well...

But we don’t need to build it all. Surely there is much code in base that is not used by GHC. Also, much code in GHC that we do not need to build GHC, and . So by removing that, we reduce the amount of Haskell code that we need to feed to the other implementation.

The remaining code might use some features that are not supported by our bootstrapping implementation. Mutually recursive module could be manually merged. GADTs that are only used for additional type safety could be replaced by normal ones, which might make some pattern matches incomplete. Syntactic sugar can be desugared. By simplifying the code base in that way, one might be able a fork of GHC that is within reach of the likes of Hugs or nhc98.

And if there are features that are hard to remove, maybe we can extend the bootstrapping compiler or interpreter to support them? For example, it was mostly trivial to extend Hugs with support for the # symbol in names – and we can be pragmatic and just allow it always, since we don’t need a standards conforming implementation, but merely one that works on the GHC code base. But how much would we have to implement? Probably this will be more fun in Haskell than in C, so maybe extending nhc98 would be more viable?

Help from beyond Haskell?

Or maybe it is time to create a new Haskell compiler from scratch, written in something other than Haskell? Maybe some other language that is reasonably pleasant to write a compiler in (Ocaml? Scala?), but that has the bootstrappability story already sorted out somehow.

But in the end, all variants come down to the same problem: Writing a Haskell compiler for full, contemporary Haskell as used by GHC is hard and really a lot of work – if it were not, there would at least be implementations in Haskell out there. And as long as nobody comes along and does that work, I fear that we will continue to be unable to build our nice Haskell ecosystem from scratch. Which I find somewhat dissatisfying.

Sliding Right into Information Theory

Published 2018-12-03 in sections English, Math.

It's hardly news any more, but it seems I have not blogged about my involvement last year with an interesting cryptanalysis project, which resulted in the publication Sliding right into disaster: Left-to-right sliding windows leak by Daniel J. Bernstein, me, Daniel Genkin, Leon Groot Bruinderink, Nadia Heninger, Tanja Lange, Christine van Vredendaal and Yuval Yarom, which was published at CHES 2017 and on ePrint (ePrint is the cryptographer’s version of arXiv).

This project nicely touched upon many fields of computer science: First we need systems expertise to mount a side-channel attack that uses cache timing difference to observe which line of a square-and-multiply algorithm the target process is executing. Then we need algorithm analysis required to learn from these observations partial information about the bits of the private key. This part includes nice PLy concepts like rewrite rules (see Section 3.2). Oncee we know enough about the secret keys, we can use fancy cryptography to recover the whole secret key (Section 3.4). And finally, some theoretical questions arise, such as: “How much information do we need for the attack to succeed?” and “Do we obtain this much information”, and we need some nice math and information theory to answer these.

Initially, I focused on the PL-related concepts. We programming language people are yak-shavers, and in particular “rewrite rules” just demands the creation of a DSL to express them, and an interpreter to execute them, doesn’t it? But it turned out that these rules are actually not necessary, as the key recovery can use the side-channel observation directly, as we found out later (see Section 4 of the paper). But now I was already hooked, and turned towards the theoretical questions mentioned above.

Shannon vs. Rényi

It felt good to shake the dust of some of the probability theory that I learned for my maths degree, and I also learned some new stuff. For example, it was intuitively clear that whether the attack succeeds depends on the amount of information obtained by the side channel attack, and based on prior work, the expectation was that if we know more than half the bits, then the attack would succeed. Note that for this purpose, two known “half bits” are as good as knowing one full bit; for example knowing that the secret key is either 01 or 11 (one bit known for sure) is just as good as knowing that the key is either 00 or 11.

Cleary, this is related to entropy somehow -- but how? Trying to prove that the attack works if the entropy rate of the leak is >0.5 just did not work, against all intuition. But when we started with a formula that describes when the attack succeeds, and then simplified it, we found a condition that looked suspiciously like what we wanted, namely H > 0.5, only that H was not the conventional entropy (also known as the Shannon entropy, H = −∑p ⋅ log p), but rather something else: H = −∑p2, which turned to be called the collision entropy or Rényi entropy.

This resulted in Theorem 3 in the paper, and neatly answers the question when the Heninger and Shacham key recovery algorithm, extended to partial information, can be expected to succeed in a much more general setting that just this particular side-channel attack.

Markov chains and an information theoretical spin-off

The other theoretical question is now: Why does this particular side channel attack succeed, i.e. why is the entropy rate H > 0.5. As so often, Markov chains are an immensly powerful tool to answer that question. After some transformations, I managed to model the state of the square-and-multiply algorithm, together with the side-channel leak, as a markov chain with a hidden state. Now I just had to calculate its Rényi entropy rate, right? I wrote some Haskell code to do this transformation, and also came up with an ad-hoc, intuitive way of calculating the rate. So when it was time to write up the paper, I was searching for a reference that describes the algorithm that I was using…

Only I could find none! I contacted researchers who have published related to Markov chains and entropies, but they just referred me in circles, until one of them, Maciej Skórski responded. Our conversation, highly condendensed, went like this: “Nice idea, but it can’t be right, it would solve problem X” – “Hmm, but it feels so right. Here is a proof sketch.” – “Oh, indeed, cool. I can even generalize this! Let’s write a paper”. Which we did! Analytic Formulas for Renyi Entropy of Hidden Markov Models (preprint only, it is still under submission).

More details

Because I joined the sliding-right project late, not all my contributions made it into the actual paper, and therefore I published an “inofficial appendix” separately on ePrint. It contains

  1. an alternative way to find the definitively knowable bits of the secret exponent, which is complete and can (in rare corner cases) find more bits than the rewrite rules in Section 3.1
  2. an algorithm to calculate the collision entropy H, including how to model a side-channel attack like this one as a markov chain, and how to calculate the entropy of such a markov chain, and
  3. the proof of Theorem 3.

I also published the Haskell code that I wrote for this projects, including the markov chain collision entropy stuff. It is not written with public consumption in mind, but feel free to ask if you have questions about this.

Note that all errors, typos and irrelevancies in that document and the code are purely mine and not of any of the other authors of the sliding-right paper. I’d like to thank my coauthors for the opportunity to join this project.

Swing Dancer Profile

Published 2018-08-07 in sections English, Reisen.

During my two years in Philadelphia (I was a post-doc with Stephanie Weirich at the University of Pennsylvania) I danced a lot of Swing, in particular at the weekly social “Jazz Attack”.

They run a blog, that blog features dancers, and this week – just in time for my last dance – they featured me with a short interview.

The merits of a yellow-red phase

Published 2018-07-30 in sections English, Reisen.

In my yesterday post on Continuity Lines in North America I mentioned in passing that I am big fan of German1 traffic lights, which have a brief yellow-red phase between the red and and the green phase (3 to 5 seconds, depending on road speed). A German reader of my blog was surprised by this, said that he considers it pointless, and would like to hear my reasons. So where we go…

Life without Yellow-Red

Lights that switch directly from red to green cause more stress. Picture yourself at the first car waiting at a traffic light, with a bunch of cars behind you. Can you relax, maybe observe the cars passing in front of you, switch the radio station, or simply look somewhere else for a moment? Well, you can, but you risk missing how the light switches from red to green. When your look at the traffic light again and see it bright green, you have no idea how long it has been on green. Hyper-aware of all the cars behind you waiting to get going, you’ll rush to get started, and if you don’t do that really fast now, surely one of the people waiting behind you will have honked.

So at the next intersection, you better don’t take your eyes off the light – or, alternatively, you develop a screw-you-I-don’t-care-attitude towards the other cars, which might allow you to relax in this situation, but is in general not desirable.

Maybe this is less of a problem on the West Coast, where supposedly everybody is nice, relaxed, and patient, and you can take all the time you want to get rolling. But in the stereotypical East Coast traffic full of angry, impatient and antisocial drivers, you really don’t get away with that.

Life with Yellow-Red

The yellow-red phase solves this problem elegantly: As long as the light is red, you don’t have to watch the light constantly and with maximum attention. You can relax: it suffices to check it often enough to catch the red-yellow phase, once every two or three seconds. When you see it, you get ready to start; and when it actually turns green, you start on time.

I would expect that it is not only less stressful, it is also more efficient: Because every car in the lane has the heads-up warning “green in a moment”, the cars can start rolling in quicker succession. Without the warning, every car to account much more for the possible slower reaction of the car before.

PS: A friend of mine wonders whether the German Yellow-Red is needed because cars with manual transmissions are much more common in Germany than in the US, and you need more time to, well, get into gear with these cars.


  1. Also Great Britain, Croatia, Latvia, Norway, Austria, Poland, Russia, Saudi-Arabia, Sweden, Switzerland, Hungary and others.

Continuity Lines

Published 2018-07-29 in sections English, Reisen.

I am currently on a road trip going from Philly going North along the Atlantic Coast into Canada, and aiming for Nova Scotia. When I passed from the US into Canada, I made had an unexpected emotional response to the highways there: I felt more at home!

And I believe it has to do with the pavement markings on US vs. Canadian freeways.

Consider this image, taken from the Manual on Uniform Traffic Control Devices of the United States, an official document describing (among other things) how the pavement of a freeway ought to be paved:

Parallel Deceleration Lane for Exit Ramp

Parallel Deceleration Lane for Exit Ramp

This is a typical exit ramp from the freeway. On ramps and cloverleaf ramps look similar. Note that the right-most lane goes somewhere else than the left and the middle lane, yet the lanes look completely identical. In particular, the lines between the lanes are shaped the same!

Now, for comparison, the corresponding image in a Canadian manual, namely the Manual of Standard Traffic Signs & Pavement Markings for British Columbia:

Canadian off-ramps

Canadian off-ramps

Here, there are different lines between the different lanes: normal lane lines left, but a so-called continuity line, with a distinctly different pattern, between the normal lanes and the exit lane. It’s like in Germany!

With this is mind I understand one reason1 why driving in the US2 noticeably more stressful: There is just always anxiety whether you are accidentally in an exit lane!

Update (2018-07-30): AS Paul Johnson pointed out (see comment below), I was looking at an old version of the MUTCD. The current version, from 2009, has these lines:

Current US off-ramps

Current US off-ramps

They have published a (very long) document describing the changes in the new version of the manual , and Section 306 describes the rationale:

[..] the FHWA adopts language to clarify that dotted lane lines, rather than broken lane lines, are to be used for non-continuing lanes, including acceleration lanes, deceleration lanes, and auxiliary lanes. [..] a number of States and other jurisdictions currently follow this practice, which is also the standard practice in Europe and most other developed countries. The FHWA believes that the existing use of a normal broken lane line for these non- continuing lanes does not adequately inform road users of the lack of lane continuity ahead and that the standardized use of dotted lane lines for non-continuing lanes as adopted in this final rule will better serve this important purpose in enhancing safety and uniformity.

So all is well! But it means that either Pennsylvania was slower than allowed in implementing these changes (the deadline was December 2016), or it was something else alltogether that made me feel more at home on the Canadian freeway.


  1. I say “one reason”, not “the reason”, because there are many more – “Rechtsfahrgebot”, no red-and-yellow-phase in the traffic light, Pennsylvanian road quality…

  2. Supposedly, Pennsylvania is particularly bad with roads in general, but also with this particular problem, and California has exit lanes clearly separated. But, of course, because this is the US, not using the same patter as the others (Canada, Europe), but using spaced dots…

Build tool semantic aware build systems

Published 2018-07-28 in sections English, Digital World.

I just had a lovely visit at Ben Gamari and Laura Dietz, and at the breakfast table we had an idea that I want to record here.

The problem

Laura and Ben talked about the struggles they had using build systems like make or Nix in data science applications. A build system like nix is designed around the idea that builds are relatively cheap, and that any change in a dependency ought to trigger a rebuild, just to be sure that all build outputs are up-to-date. This is a perfectly valid assumption when building software, where build times are usually minutes, or maybe hours. But when some of the data processing takes days, then you really really want to avoid any unnecessary builds.

One way of avoiding unnecessary rebuilds that is supported by build systems like shake and (I presume, maybe with some extra work) Nix, is to do output hashing: If one build stage has its input changed and need to be re-run, but its output is actually unchanged, then later build stages do not have to be re-run. Great!

But even with this feature in place, there one common problem remains: Build tools! Your multi-gigabyte data is going to be processed by some code you write. What if the code changes? Clearly, if you change the algorithm, or fix a bug in your code, you want the output to be re-run. But if you just change some strings in the --help flag, or update some libraries, or do something else that does not change the program in a way that is significant for the task at hand, wouldn’t you prefer to not pay for that by futile multi-day data processing step?

Existing approaches

There are various ways you can tackle this these days:

  • You bite the bullet, add the build tool as a dependency of the processing step. You get the assurance that your data always reflects the output of the latest version of your tool, but you get lots of rebuilds.

  • You don’t track the tool as part of your build system. It is now completely up to you to decide when the tool has changed in significant ways. When you think it has, you throw away all build artifacts and start from scratch. Very crude, and easy to get wrong.

  • You keep track of a “build tool version”, e.g. a text file with a number, that you depend on in lieu of the build tool itself. When you make a change that you think is significant, you bump that version number. This is similar to the previous approach, but more precise: Only the build outputs that used this particular tools will be invalidated, and it also scales better to working in a team. But of course, you can still easily get it wrong.

Build tool semantics

Why are all these approaches so unsatisfying? Because none allow you to express the intent, which is to say “this build step depends on the semantics (i.e. behaviour) of the build tool”. If we could somehow specify that, then all would be great: Build tool changes, but its semantics is unaffected? no rebuild. Build tool changes, semantics change? rebuild.

This ideal is not reachable (blah blah halting problem blah blah) – but I believe we can approximate it. At least if good practices were used and the tool has a test suite!

Assume for now that the tool is a simple patch-processing tool, i.e. takes some input and produces some output. Assume further that there is a test suite with a small but representative set of example inputs, or maybe some randomly generated inputs (using a fixed seed). If the test suite is good, then the (hash of) the output on the test suite example is an approximation of the programs semantic.

And the build system can use this “semantics hash”. If the build tool changes, the build system can re-run the test suite and compare the output with the previous run. If they change, then the tools seems to have changed in significant ways, and it needs to re-run the data processing. But if the test suite outputs are unchanged, then it can assume that the behaviour of the tool has not changed, re-use the existing data outputs.

That is the core idea! A few additional thoughts, though:

  • What if the test suite changes? If the above idea is implemented naively, then adding a test case to the test suite will change the semantics, and re-build everything. That would be horrible in terms of incentives! So instead, the build systems should keep the old version of the build tool around, re-calculate its semantics hash based on the new test suite, and then compare that. This way, extending the test suite does not cause recompilation.

Hash-and-store-based build systems like Nix should have no problem keeping the previous version of the build tool around as long there is output that depends on it.

  • A consequence of this approach is that if you find and fix a bug in your tool that is not covered by the existing test suite, then you absolutely have to add a test for that to your test suite, otherwise the bug fix will not actually make it to your data output. This might be the biggest downside of the approach, but at least it sets incentives right in that it makes you maintain a good regression test suite.

  • What if things go wrong, the test suite is incomplete and the build system re-uses build output because it wrongly assumes that two versions of the build tool have the same semantics?

If done right, this can be detected and fixed: The build system needs to record which tool version (and not just which “semantics hash”) was used to produce a particular output. So once the test suite uncovers the difference, the build systems will no longer consider the two versions equivalent and – after the fact – invalidate the re-used of the previous build output, and re-build what needs to be rebuild

I am curious to hear if anybody has played with these or similar ideas before? How did it go? Would it be possible to implement this in Nix? In Shake? Other systems? Tell me your thoughts!

WebGL, Fragment Shader, GHCJS and reflex-dom

Published 2018-07-22 in sections English, Haskell.

What a potpourri of topics... too long to read? Click here!

On the side and very slowly I am working on a little game that involves breeding spherical patterns… more on that later (maybe). I want to implement it in Haskell, but have it run in the browser, so I reached for GHCJS, the Haskell-to-Javascript compiler.

WebGL for 2D images

A crucial question was: How do I draw a generative pattern onto a HTML canvas element. My first attempt was to calculate the pixel data into a bit array and use putImageData() to push it onto the canvas, but it was prohibitively slow. I might have done something stupid along the way, and some optimization might have helped, but I figured that I should not myself calculate the colors of each pixel, but leave this to who is best at it: The browser and (ideally) the graphic card.

So I took this as an opportunity to learn about WebGL, in particular fragment shaders. The term shader is misleading, and should mentally be replaced with “program”, because it is no longer (just) about shading. WebGL is intended to do 3D graphics, and one sends a bunch of coordinates for triangles, a vertex shader and a fragment shader to the browser. The vertex shader can places the vertices, while the fragment shader colors each pixel on the visible triangles. This is a gross oversimplification, but that is fine: We only really care about the last step, and if our coordinates always just define a rectangle that fills the whole canvas, and the vertex shader does not do anything interesting, then what remains is a HTML canvas that takes a program (written in the GL shader language), which is run for each pixel and calculates the color to be shown at that pixel.

Perfect! Just what I need. Dynamically creating a program that renders the pattern I want to show is squarely within Haskell’s strengths.

A reflex-dom widget

As my game UI grows, I will at some point no longer want to deal with raw DOM access, events etc., and the abstraction that makes creating such user interfaces painless is Functional Reactive Programming (FRP). One of the main mature implementations is Ryan Trinkle's reflex-dom, and I want to use this project to get more hands-on experience with it.

Based on my description above, once I hide all the details of the WebGL canvas setup, what I really have is a widget that takes a text string (representing the fragment shader), and creates a DOM element for it. This would suggest a function with this type signature

fragmentShaderCanvas ::
    MonadWidget t m =>
    Dynamic t Text ->
    m ()

where the input text is dynamic, meaning it can change over time (and the canvas will be updated) accordingly. In fact, I also want to specify attributes for the canvas (especially width and height), and if the supplied fragment shader source is invalid and does not compile, I want to get my hands on error messages, as provided by the browser. So I ended up with this:

fragmentShaderCanvas ::
    MonadWidget t m =>
    Map Text Text ->
    Dynamic t Text ->
    m (Dynamic t (Maybe Text))

which very pleasingly hides all the complexity of setting up the WebGL context from the user. This is abstraction at excellence!

I published this widget in the hackage.haskell.org/package/reflex-dom-fragment-shader-canvas package on Hackage.

A Demo

And because reflex-dom make it so nice, I created a little demo program; it is essentially a fragment shader playground!

On https://nomeata.github.io/reflex-dom-fragment-shader-canvas/ you will find a text area where you can edit the fragment shader code. All your changes are immediately reflected in the canvas on the right, and in the list of warnings and errors below the text area. The code for this demo is pretty short.

A few things could be improved, of course: For example, the canvas element should have its resolution automatically adjusted to the actual size on screen, but it is somewhat tricky to find out when and if a DOM element has changed size. Also, the WebGL setup should be rewritten to be more defensively, and fail more gracefully if things go wrong.

BTW, if you need a proper shader playground, check out Shadertoy.

Development and automatic deployment

The reflex authors all use Nix as their development environment, and if you want to use reflex-dom, then using Nix is certainly the path of least resistance. But I would like to point out that it is not a necessity, and you can stay squarely in cabal land if you want:

  • You don’t actually need ghcjs to develop your web application: reflex-dom builds on jsaddle which has a mode where you build your program using normal GHC, and it runs a web server that your browser connects to. It works better with Chrome than with Firefox at the moment, but is totally adequate to develop a program.

  • If you do want to install ghcjs, then it is actually relatively easily: The README on the ghc-8.2 branch of GHCJS tells you how to build and install GHCJS with cabal new-build.

  • cabal itself supports ghcjs just like ghc! Just pass --ghcjs -w ghcjs to it.

  • Because few people use ghcjs and reflex with cabal some important packages (ghcjs-base, reflex, reflex-dom) are not on Hackage, or only with old versions. You can point cabal to local checkouts using a cabal.project file or even directly to the git repositories. But it is simpler to just use a Hackage overlay that I created with these three packages, until they are uploaded to Hackage.

  • If the application you create is a pure client-based program and could therefore be hosted on any static web host, wouldn’t it be nice if you could just have it appear somewhere in the internet whenever you push to your project? Even that is possible, as I describe in an example repository!

It uses Travis CI to build GHCJS and the dependencies, caches them, builds your program and – if successful – uploads the result to GitHub Pages. In fact, the demo linked above is produced using that. Just push, and 5 minutes later the changes available online!

I know about rumors that Herbert’s excellent multi-GHC PPA repository might provide .deb packages with GHCJS prebuilt soon. Once that happens, and maybe ghcjs-base and reflex get uploaded to Hackage, then the power of reflex-based web development will be conveniently available to all Haskell developers (even those who shunned Nix so far), and I am looking forward to many cool projects coming out of that.

The diameter of German+English

Published 2018-05-23 in sections English, Digital World.

Languages never map directly onto each other. The English word fresh can mean frisch or frech, but frish can also be cool. Jumping from one words to another like this yields entertaining sequences that take you to completely different things. Here is one I came up with:

frechfreshfrishcoolabweisenddismissivewegwerfendtrashingverhauendbangingGeklopfeknocking – …

And I could go on … but how far? So here is a little experiment I ran:

  1. I obtained a German-English dictionary. Conveniently, after registration, you can get dict.cc’s translation file, which is simply a text file with three columns: German, English, Word form.

  2. I wrote a program that takes these words and first canonicalizes them a bit: Removing attributes like [ugs.] [regional], {f}, the to in front of verbs and other embellishment.

  3. I created the undirected, bipartite graph of all these words. This is a pretty big graph – ~750k words in each language, a million edges. A path in this graph is precisely a sequence like the one above.

  4. In this graph, I tried to find a diameter. The diameter of a graph is the longest path between two nodes that you cannot connect with a shorter path.

Because the graph is big (and my code maybe not fully optimized), it ran a few hours, but here it is: The English expression be annoyed by sb. and the German noun Icterus are related by 55 translations. Here is the full list:

  • be annoyed by sb.
  • durch jdn. verärgert sein
  • be vexed with sb.
  • auf jdn. böse sein
  • be angry with sb.
  • jdm. böse sein
  • have a grudge against sb.
  • jdm. grollen
  • bear sb. a grudge
  • jdm. etw. nachtragen
  • hold sth. against sb.
  • jdm. etw. anlasten
  • charge sb. with sth.
  • jdn. mit etw. [Dat.] betrauen
  • entrust sb. with sth.
  • jdm. etw. anvertrauen
  • entrust sth. to sb.
  • jdm. etw. befehlen
  • tell sb. to do sth.
  • jdn. etw. heißen
  • call sb. names
  • jdn. beschimpfen
  • abuse sb.
  • jdn. traktieren
  • pester sb.
  • jdn. belästigen
  • accost sb.
  • jdn. ansprechen
  • address oneself to sb.
  • sich an jdn. wenden
  • approach
  • erreichen
  • hit
  • Treffer
  • direct hit
  • Volltreffer
  • bullseye
  • Hahnenfuß-ähnlicher Wassernabel
  • pennywort
  • Mauer-Zimbelkraut
  • Aaron's beard
  • Großkelchiges Johanniskraut
  • Jerusalem star
  • Austernpflanze
  • goatsbeard
  • Geißbart
  • goatee
  • Ziegenbart
  • buckhorn plantain
  • Breitwegerich / Breit-Wegerich
  • birdseed
  • Acker-Senf / Ackersenf
  • yellows
  • Gelbsucht
  • icterus
  • Icterus

Pretty neat!

So what next?

I could try to obtain an even longer chain by forgetting whether a word is English or German (and lower-casing everything), thus allowing wild jumps like hathuthüttelodge.

Or write a tool where you can enter two arbitrary words and it finds such a path between them, if there exists one. Unfortunately, it seems that the terms of the dict.cc data dump would not allow me to create such a tool as a web site (but maybe I can ask).

Or I could throw in additional languages!

What would you do?

Update (2018-06-17):

I ran the code again, this time lower-casing all words, and allowing false-friends translations, as suggested above. The resulting graph has – surprisingly – precisely the same diamater (55), but with a partly different list:

  • peyote
  • peyote
  • mescal
  • meskal
  • mezcal
  • blaue agave
  • maguey
  • amerikanische agave
  • american agave
  • jahrhundertpflanze
  • century plant
  • fächerlilie
  • tumbleweed
  • weißer fuchsschwanz
  • common tumbleweed
  • acker-fuchsschwanz / ackerfuchsschwanz
  • rough pigweed
  • grünähriger fuchsschwanz
  • love-lies-bleeding
  • stiefmütterchen
  • pansy
  • schwuchtel
  • fruit
  • ertrag
  • gain
  • vorgehen
  • approach
  • sich an jdn. wenden
  • address oneself to sb.
  • jdn. ansprechen
  • accost sb.
  • jdn. belästigen
  • pester sb.
  • jdn. traktieren
  • abuse sb.
  • jdn. beschimpfen
  • call sb. names
  • jdn. etw. heißen
  • tell sb. to do sth.
  • jdm. etw. befehlen
  • entrust sth. to sb.
  • jdm. etw. anvertrauen
  • entrust sb. with sth.
  • jdn. mit etw. [dat.] betrauen
  • charge sb. with sth.
  • jdm. etw. zur last legen
  • hold sth. against sb.
  • jdm. etw. nachtragen
  • bear sb. a grudge
  • jdm. grollen
  • have a grudge against sb.
  • jdm. böse sein
  • be angry with sb.
  • auf jdn. böse sein
  • be mad at sb.
  • auf jdn. einen (dicken) hals haben

Note that there is not actually a false-friend in this list – it seems that adding the edges just changed the order of edges in the graph representation and my code just happened to find a different diamer.

Proof reuse in Coq using existential variables

Published 2018-05-18 in sections English, Coq.

This is another technical post that is only of interest only to Coq users.

TL;DR: Using existential variable for hypotheses allows you to easily refactor a complicated proof into an induction schema and the actual proofs.

Setup

As a running example, I will use a small theory of “bags”, which you can think of as lists represented as trees, to allow an O(1) append operation:

Require Import Coq.Arith.Arith.
Require Import Psatz.
Require FunInd.

(* The data type *)
Inductive Bag a : Type :=
  | Empty : Bag a
  | Unit  : a -> Bag a
  | Two   : Bag a -> Bag a -> Bag a.

Arguments Empty {_}.
Arguments Unit {_}.
Arguments Two {_}.

Fixpoint length {a} (b : Bag a) : nat :=
  match b with
  | Empty     => 0
  | Unit _    => 1
  | Two b1 b2 => length b1 + length b2
  end.

(* A smart constructor that ensures that a [Two] never
   has [Empty] as subtrees. *)
Definition two {a} (b1 b2 : Bag a) : Bag a := match b1 with
  | Empty => b2
  | _ => match b2 with | Empty => b1
                       | _ => Two b1 b2 end end.

Lemma length_two {a} (b1 b2 : Bag a) :
  length (two b1 b2) = length b1 + length b2.
Proof. destruct b1, b2; simpl; lia. Qed.

(* A first non-trivial function *)
Function take {a : Type} (n : nat) (b : Bag a) : Bag a :=
  if n =? 0
  then Empty
  else match b with
       | Empty     => b
       | Unit x    => b
       | Two b1 b2 => two (take n b1) (take (n - length b1) b2)
       end.

The theorem

The theorem that I will be looking at in this proof describes how length and take interact:

Theorem length_take''':
  forall {a} n (b : Bag a),
  length (take n b) = min n (length b).

Before I dive into it, let me point out that this example itself is too simple to warrant the techniques that I will present in this post. I have to rely on your imagination to scale this up to appreciate the effect on significantly bigger proofs.

Naive induction

How would we go about proving this lemma? Surely, induction is the way to go! And indeed, this is provable using induction (on the Bag) just fine:

Proof.
  intros.
  revert n.
  induction b; intros n.
  * simpl.
    destruct (Nat.eqb_spec n 0).
    + subst. rewrite Nat.min_0_l. reflexivity.
    + rewrite Nat.min_0_r. reflexivity.
  * simpl.
    destruct (Nat.eqb_spec n 0).
    + subst. rewrite Nat.min_0_l. reflexivity.
    + simpl. lia.
  * simpl.
    destruct (Nat.eqb_spec n 0).
    + subst. rewrite Nat.min_0_l. reflexivity.
    + simpl. rewrite length_two, IHb1, IHb2. lia.
Qed.

But there is a problem: A proof by induction on the Bag argument immediately creates three subgoals, one for each constructor. But that is not how take is defined, which first checks the value of n, independent of the constructor. This means that we have to do the case-split and the proof for the case n = 0 three times, although they are identical. It’s a one-line proof here, but imagine something bigger...

Proof by fixpoint

Can we refactor the proof to handle the case n = 0 first? Yes, but not with a simple invocation of the induction tactic. We could do well-founded induction on the length of the argument, or we can do the proof using the more primitive fix tactic. The latter is a bit hairy, you won’t know if your proof is accepted until you do Qed (or check with Guarded), but when it works it can yield some nice proofs.

Proof.
  intros a.
  fix IH 2.
  intros.
  rewrite take_equation.
  destruct (Nat.eqb_spec n 0).
  + subst n. rewrite Nat.min_0_l. reflexivity.
  + destruct b.
    * rewrite Nat.min_0_r. reflexivity.
    * simpl. lia.
    * simpl. rewrite length_two, !IH. lia.
Qed.

Nice: we eliminated the duplication of proofs!

A functional induction lemma

Again, imagine that we jumped through more hoops here ... maybe some well-founded recursion with a tricky size measure and complex proofs that the measure decreases ... or maybe you need to carry around an invariant about your arguments and you have to work hard to satisfy the assumption of the induction hypothesis.

As long as you do only one proof about take, that is fine. As soon as you do a second proof, you will notice that you have to repeat all of that, and it can easily make up most of your proof...

Wouldn’t it be nice if you can do the common parts of the proofs only once, obtain a generic proof scheme that you can use for (most) proofs about take, and then just fill in the blanks?

Incidentally, the Function command provides precisely that:

take_ind
     : forall (a : Type) (P : nat -> Bag a -> Bag a -> Prop),
       (forall (n : nat) (b : Bag a), (n =? 0) = true -> P n b Empty) ->
       (forall (n : nat) (b : Bag a), (n =? 0) = false -> b = Empty -> P n Empty b) ->
       (forall (n : nat) (b : Bag a), (n =? 0) = false -> forall x : a, b = Unit x -> P n (Unit x) b) ->
       (forall (n : nat) (b : Bag a),
        (n =? 0) = false ->
        forall b1 b2 : Bag a,
        b = Two b1 b2 ->
        P n b1 (take n b1) ->
        P (n - length b1) b2 (take (n - length b1) b2) ->
        P n (Two b1 b2) (two (take n b1) (take (n - length b1) b2))) ->
       forall (n : nat) (b : Bag a), P n b (take n b)

which is great if you can use Function (although not perfect – we’d rather see n = 0 instead of (n =? 0) = true), but often Function is not powerful enough to define the function you care about.

Extracting the scheme from a proof

We could define our own take_ind' by hand, but that is a lot of work, and we may not get it right easily, and when we change out functions, there is now this big proof statement to update.

Instead, let us use existentials, which are variables where Coq infers their type from how we use them, so we don’t have to declare them. Unfortunately, Coq does not support writing just

Lemma take_ind':
  forall (a : Type) (P : nat -> Bag a -> Bag a -> Prop),
  forall (IH1 : ?) (IH2 : ?) (IH3 : ?) (IH4 : ?),
  forall n b, P n b (take n b).

where we just leave out the type of the assumptions (Isabelle does...), but we can fake it using some generic technique.

We begin with stating an auxiliary lemma using a sigma type to say “there exist some assumption that are sufficient to show the conclusion”:

Lemma take_ind_aux:
  forall a (P : _ -> _ -> _ -> Prop),
  { Hs : Prop |
    Hs -> forall n (b : Bag a), P n b (take n b)
  }.

We use the [eexist tactic])(https://coq.inria.fr/refman/proof-engine/tactics.html#coq:tacv.eexists) (existential exists) to construct the sigma type without committing to the type of Hs yet.

Proof.
  intros a P.
  eexists.
  intros Hs.

This gives us an assumption Hs : ?Hs – note the existential type. We need four of those, which we can achieve by writing

  pose proof Hs as H1. eapply proj1 in H1. eapply proj2 in Hs.
  pose proof Hs as H2. eapply proj1 in H2. eapply proj2 in Hs.
  pose proof Hs as H3. eapply proj1 in H3. eapply proj2 in Hs.
  rename Hs into H4.

we now have this goal state:

1 subgoal
a : Type
P : nat -> Bag a -> Bag a -> Prop
H4 : ?Goal2
H1 : ?Goal
H2 : ?Goal0
H3 : ?Goal1
______________________________________(1/1)
forall (n : nat) (b : Bag a), P n b (take n b)

At this point, we start reproducing the proof of length_take: The same approach to induction, the same case splits:

  fix IH 2.
  intros.
  rewrite take_equation.
  destruct (Nat.eqb_spec n 0).
  + subst n.
    revert b.
    refine H1.
  + rename n0 into Hnot_null.
    destruct b.
    * revert n Hnot_null.
      refine H2.
    * rename a0 into x.
      revert x n Hnot_null.
      refine H3.
    * assert (IHb1 : P n b1 (take n b1)) by apply IH.
      assert (IHb2 : P (n - length b1) b2 (take (n - length b1) b2)) by apply IH.
      revert n b1 b2 Hnot_null IHb1 IHb2.
      refine H4.
Defined. (* Important *)

Inside each case, we move all relevant hypotheses into the goal using revert and refine with the corresponding assumption, thus instantiating it. In the recursive case (Two), we assert that P holds for the subterms, by induction.

It is important to end this proofs with Defined, and not Qed, as we will see later.

In a next step, we can remove the sigma type:

Definition take_ind' a P := proj2_sig (take_ind_aux a P).

The type of take_ind' is as follows:

take_ind'
     : forall (a : Type) (P : nat -> Bag a -> Bag a -> Prop),
       proj1_sig (take_ind_aux a P) ->
       forall n b, P n b (take n b)

This looks almost like an induction lemma. The assumptions of this lemma have the not very helpful type proj1_sig (take_ind_aux a P), but we can already use this to prove length_take:

Theorem length_take:
  forall {a} n (b : Bag a),
  length (take n b) = min n (length b).
Proof.
  intros a.
  intros.
  apply take_ind' with (P := fun n b r => length r = min n (length b)).
  repeat apply conj; intros.
  * rewrite Nat.min_0_l. reflexivity.
  * rewrite Nat.min_0_r. reflexivity.
  * simpl. lia.
  * simpl. rewrite length_two, IHb1, IHb2. lia.
Qed.

In this case I have to explicitly state P where I invoke take_ind', because Coq cannot figure out this instantiation on its own (it requires higher-order unification, which is undecidable and unpredictable). In other cases I had more luck.

After I apply take_ind', I have this proof goal:

______________________________________(1/1)
proj1_sig (take_ind_aux a (fun n b r => length r = min n (length b)))

which is the type that Coq inferred for Hs above. We know that this is a conjunction of a bunch of assumptions, and we can split it as such, using repeat apply conj. At this point, Coq needs to look inside take_ind_aux; this would fail if we used Qed to conclude the proof of take_ind_aux.

This gives me four goals, one for each case of take, and the remaining proofs really only deals with the specifics of length_take – no more general dealing with worrying about getting the induction right and doing the case-splitting the right way.

Also note that, very conveniently, Coq uses the same name for the induction hypotheses IHb1 and IHb2 that we used in take_ind_aux!

Making it prettier

It may be a bit confusing to have this proj1_sig in the type, especially when working in a team where others will use your induction lemma without knowing its internals. But we can resolve that, and also turn the conjunctions into normal arrows, using a bit of tactic support. This is completely generic, so if you follow this procedure, you can just copy most of that:

Lemma uncurry_and: forall {A B C}, (A /\ B -> C) -> (A -> B -> C).
Proof. intros. intuition. Qed.
Lemma under_imp:   forall {A B C}, (B -> C) -> (A -> B) -> (A -> C).
Proof. intros. intuition. Qed.
Ltac iterate n f x := lazymatch n with
  | 0 => x
  | S ?n => iterate n f uconstr:(f x)
end.
Ltac uncurryN n x :=
  let n' := eval compute in n in
  lazymatch n' with
  | 0 => x
  | S ?n => let uc := iterate n uconstr:(under_imp) uconstr:(uncurry_and) in
            let x' := uncurryN n x in
            uconstr:(uc x')
end.

With this in place, we can define our final proof scheme lemma:

Definition take_ind'' a P
  := ltac:(let x := uncurryN 3 (proj2_sig (take_ind_aux a P)) in exact x).
Opaque take_ind''.

The type of take_ind'' is now exactly what we’d wish for: All assumptions spelled out, and the n =? 0 already taken of (compare this to the take_ind provided by the Function command above):

take_ind''
     : forall (a : Type) (P : nat -> Bag a -> Bag a -> Prop),
       (forall b : Bag a, P 0 b Empty) ->
       (forall n : nat, n <> 0 -> P n Empty Empty) ->
       (forall (x : a) (n : nat), n <> 0 -> P n (Unit x) (Unit x)) ->
       (forall (n : nat) (b1 b2 : Bag a),
        n <> 0 ->
        P n b1 (take n b1) ->
        P (n - length b1) b2 (take (n - length b1) b2) ->
        P n (Two b1 b2) (two (take n b1) (take (n - length b1) b2))) ->
       forall (n : nat) (b : Bag a), P n b (take n b)

At this point we can mark take_ind'' as Opaque, to hide how we obtained this lemma.

Our proof does not change a lot; we merely no longer have to use repeat apply conj:

Theorem length_take''':
  forall {a} n (b : Bag a),
  length (take n b) = min n (length b).
Proof.
  intros a.
  intros.
  apply take_ind'' with (P := fun n b r => length r = min n (length b)); intros.
  * rewrite Nat.min_0_l. reflexivity.
  * rewrite Nat.min_0_r. reflexivity.
  * simpl. lia.
  * simpl. rewrite length_two, IHb1, IHb2. lia.
Qed.

Is it worth it?

It was in my case: Applying this trick in our ongoing work of verifying parts of the Haskell compiler GHC separated a somewhat proof into a re-usable proof scheme (go_ind), making the actual proofs (go_all_WellScopedFloats, go_res_WellScoped) much neater and to the point. It saved “only” 60 lines (if I don’t count the 20 “generic” lines above), but the pay-off will increase as I do even more proofs about this function.

Avoid the dilemma of the trailing comma

Published 2018-04-30 in sections English, Haskell.

The Haskell syntax uses comma-separated lists in various places and does, in contrast to other programming language, not allow a trailing comma. If everything goes on one line you write

  (foo, bar, baz)

and everything is nice.

Lining up

But if you want to have one entry on each line, then the obvious plan

  (foo,
   bar,
   baz
  )

is aesthetically unpleasing and moreover, extending the list by one to

  (foo,
   bar,
   baz,
   quux
  )

modifies two lines, which produces less pretty diffs.

Because it is much more common to append to lists rather than to prepend, Haskellers have developed the idiom of leading comma:

  ( foo
  , bar
  , baz
  , quux
  )

which looks strange until you are used to it, but solves the problem of appending to a list. And we see this idiom in many places:

  • In Cabal files:

      build-depends: base >= 4.3 && < 5
                   , array
                   , deepseq >= 1.2 && < 1.5
  • In module headers:

    {-# LANGUAGE DefaultSignatures
               , EmptyCase
               , ExistentialQuantification
               , FlexibleContexts
               , FlexibleInstances
               , GADTs
               , InstanceSigs
               , KindSignatures
               , RankNTypes
               , ScopedTypeVariables
               , TemplateHaskell
               , TypeFamilies
               , TypeInType
               , TypeOperators
               , UndecidableInstances #-}

Think outside the list!

I started to avoid this pattern where possible. And it is possible everywhere instead of having a declaration with a list, you can just have multiple declarations. I.e.:

  • In Cabal files:

      build-depends: base >= 4.3 && < 5
      build-depends: array
      build-depends: deepseq >= 1.2 && < 1.5
  • In module headers:

    {-# LANGUAGE DefaultSignatures #-}
    {-# LANGUAGE EmptyCase #-}
    {-# LANGUAGE ExistentialQuantification #-}
    {-# LANGUAGE FlexibleContexts #-}
    {-# LANGUAGE FlexibleInstances #-}
    {-# LANGUAGE GADTs #-}
    {-# LANGUAGE InstanceSigs #-}
    {-# LANGUAGE KindSignatures #-}
    {-# LANGUAGE RankNTypes #-}
    {-# LANGUAGE ScopedTypeVariables #-}
    {-# LANGUAGE TemplateHaskell #-}
    {-# LANGUAGE TypeFamilies #-}
    {-# LANGUAGE TypeInType #-}
    {-# LANGUAGE TypeOperators #-}
    {-# LANGUAGE UndecidableInstances #-}

It is a bit heavier, but it has a number of advantages:

  1. Both appending and prepending works without touching other lines.
  2. It is visually more homogeneous, making it – despite the extra words – easier to spot mistakes visually.
  3. You can easily sort the declarations alphabetically with your editor.
  4. Especially in Cabal files: If you have syntax error in your dependency specification (which I always have, writing << instead of < due to my Debian background), cabal will actually give you a helpful error location – it always only tells you which build-depends stanza was wrong, so if you have only one, then that’s not helpful.

What when it does not work?

Unfortunately, not every list in Haskell can have that treatment, and that’s why the recent GHC proposal on ExtraCommas wants to lift the restriction. In particular, it wants to allow trailing commas in subexport lists:

module Foo
    ( Foo(
        A,
	B,
      ),
      fromFoo,
    )

(Weirdly, export lists already allow trailing commas). An alternative here might be to write

module Foo
    ( Foo(A),
      Foo(B),
      fromFoo,
    )

and teach the compiler to not warn about the duplicate export of the Foo type.

For plain lists, this idiom can be useful:

list :: [Int]
list = let (>>) = (++) in do
   [ 1 ]
   [ 2 ]
   [ 3 ]
   [ 4 ]

It requires RebindableSyntax, so I do not recommend it for regular code, but it can be useful in a module that is dedicated to hold some generated data or configuration. And of course it works with any binary operator, not just (++)