Joachim Breitner

Thoughts on bootstrapping GHC

Published 2018-12-13 in sections English, Haskell.

I am returning from the reproducible builds summit 2018 in Paris. The latest hottest thing within the reproducible-builds project seems to be bootstrapping: How can we build a whole operating system from just and only source code, using very little, or even no, binary seeds or auto-generated files. This is actually concern that is somewhat orthogonal to reproducibility: Bootstrappable builds help me in trusting programs that I built, while reproducible builds help me in trusting programs that others built.

And while they make good progress bootstrapping a full system from just a C compiler written in Scheme, and a Scheme interpreter written in C, that can build each other (Janneke’s mes project), and there are plans to build that on top of stage0, which starts with a 280 bytes of binary, the situation looks pretty bad when it comes to Haskell.

Unreachable GHC

The problem is that contemporary Haskell has only one viable implementation, GHC. And GHC, written in contemporary Haskell, needs GHC to be build. So essentially everybody out there either just downloads a binary distribution of GHC. Or they build GHC from source, using a possibly older (but not much older) version of GHC that they already have. Even distributions like Debian do nothing different: When they build the GHC package, the builders use, well, the GHC package.

There are other Haskell implementations out there. But if they are mature and active developed, then they are implemented in Haskell themselves, often even using advanced features that only GHC provides. And even those are insufficient to build GHC itself, let alone the some old and abandoned Haskell implementations.

In all these cases, at some point an untrusted binary is used. This is very unsatisfying. What can we do? I don’t have the answers, but please allow me to outline some venues of attack.

Retracing history

Obviously, even GHC does not exist since the beginning of time, and the first versions surely were built using something else than GHC. The oldest version of GHC for which we can find a release on the GHC web page is version 0.29 from July 1996. But the installation instructions write:

GHC 0.26 doesn't build with HBC. (It could, but we haven't put in the effort to maintain it.)

GHC 0.26 is best built with itself, GHC 0.26. We heartily recommend it. GHC 0.26 can certainly be built with GHC 0.23 or 0.24, and with some earlier versions, with some effort.

GHC has never been built with compilers other than GHC and HBC.

So it seems that besides GHC, only ever HBC was used to compiler GHC. HBC is a Haskell compiler where we find the sources of one random version only thanks to archive.org. Parts of it are written in C, so I looked into this: Compile HBC, use it to compile GHC-0.29, and then step for step build every (major) version of GHC until today.

The problem is that it is non-trivial to build software from the 90s using today's compilers. I briefly looked at the HBC code base, and had to change some files from using varargs.h to stdargs.v, and this is surely just one of many similar stumbling blocks trying to build that tools. Oh, and even the hbc source state

# To get everything done: make universe
# It is impossible to make from scratch.
# You must have a running lmlc, to
# recompile it (of course).

So I learned that actually, most of it is written in LML, and the LML compiler is written in LML. So this is a dead end. (Thanks to Lennart for clearing up a misunderstanding on my side here.

Going back, but doing it differently

Another approach is to go back in time, to some old version of GHC, but maybe not all the way to the beginning, and then try to use another, officially unsupported, Haskell compiler to build GHC. This is what rekado tried to do in 2017: He use the most contemporary implementation of Haskell in C, the Hugs interpreter. Using this, he compiled nhc98 (yet another abandoned Haskell implementation), with the hope of building GHC with nhc98. He made impressive progress back then, but ran into a problem where the runtime crashed. Maybe someone is interested in picking up up from there?

Removing, simplifying, extending, in the present.

Both approaches so far focus on building an old version of GHC. This adds complexity: other tools (the shell, make, yacc etc.) may behave different now in a way that causes hard to debug problems. So maybe it is more fun and more rewarding to focus on today’s GHC? (At this point I am starting to hypothesize).

I said before that no other existing Haskell implementation can compile today’s GHC code base, because of features like mutually recursive modules, the foreign function interface etc. And also other existing Haskell implementations often come with a different, smaller set of standard libraries, but GHC assumes base, so we would have to build that as well...

But we don’t need to build it all. Surely there is much code in base that is not used by GHC. Also, much code in GHC that we do not need to build GHC, and . So by removing that, we reduce the amount of Haskell code that we need to feed to the other implementation.

The remaining code might use some features that are not supported by our bootstrapping implementation. Mutually recursive module could be manually merged. GADTs that are only used for additional type safety could be replaced by normal ones, which might make some pattern matches incomplete. Syntactic sugar can be desugared. By simplifying the code base in that way, one might be able a fork of GHC that is within reach of the likes of Hugs or nhc98.

And if there are features that are hard to remove, maybe we can extend the bootstrapping compiler or interpreter to support them? For example, it was mostly trivial to extend Hugs with support for the # symbol in names – and we can be pragmatic and just allow it always, since we don’t need a standards conforming implementation, but merely one that works on the GHC code base. But how much would we have to implement? Probably this will be more fun in Haskell than in C, so maybe extending nhc98 would be more viable?

Help from beyond Haskell?

Or maybe it is time to create a new Haskell compiler from scratch, written in something other than Haskell? Maybe some other language that is reasonably pleasant to write a compiler in (Ocaml? Scala?), but that has the bootstrappability story already sorted out somehow.

But in the end, all variants come down to the same problem: Writing a Haskell compiler for full, contemporary Haskell as used by GHC is hard and really a lot of work – if it were not, there would at least be implementations in Haskell out there. And as long as nobody comes along and does that work, I fear that we will continue to be unable to build our nice Haskell ecosystem from scratch. Which I find somewhat dissatisfying.

Comments

I read your blog post and afterwards looked at what CFEngine is doing these days. Whenever I looked at CFEngine I didn't like that it's a good idea with a very bad programming language.

If there just were a minimal, core haskell language with a tiny compiler that one could effort to run on all nodes, even maybe embedded systems...

People also complain that xmonad requires you to install GHC. If there just were a smaller Haskell with a smaller compiler that one wouldn't hesitate to install on their small laptop.

Newcomer to haskell are often overwhelmed by the many features of the language. If there just were a smaller core that one could call beginner-haskell and tell people that this is all they need to actually be productive.

Are you aware of such an existing definition of a sub-set of haskell that would be a usable programming language, easier to implement as a compiler and resulting in a smaller compiler?

#1 Thomas Koch am 2018-12-13

I don't know any of them in detail, but presumably dialects like Elm, Frege, PureScript might qualify.

But it would not help here, I think: Language features are there because they make programming more pleasant, and compiler writers, too, want to use them. It does not seem viable to expect the GHC authors to restrict themselves to just some minimal core Haskell

#2 Joachim Breitner am 2018-12-14

The problem is that contemporary Haskell has only one viable implementation, GHC.

I disagree: Hugs is viable.

I downloaded it a few months ago, from source, and compiled it. (There were a few difficulties because I'm on a Windows platform and there were MinGW mis-matches relative to 2006 that Hugs is documented for; but I got there.)

Hugs (in HugsMode) not only compiles with Haskell 2010, but has significant extra features like MPTCs, FunDeps, Overlaps. For several of those extensions, Hugs is better designed and has a more strongly principled implementation than GHC – despite most of the bogusness in GHC being pointed out many years ago.

As to a records system: H98/GHC's is a constant source of embarrassment, and regularly gets criticised as a reason for not adopting Haskell tout court. Hugs has a records system (TRex) that is at least not embarrassing.

What other definition of "viable" can there be but 'implements the standard'? And the way there's minimal progress towards Haskell 2020, and the only topics being talked about are already implemented in Hugs (in fact were suggested by the Hugs team originally), I'm expecting Hugs will comply with H2020.

So I'd say: the problem with "contemporary Haskell" is people kow-towing to GHC HQ. For some purposes, I find Hugs more viable than GHC. I am annoyed and frustrated about the amount of nit-picking and type-theoretic purity that goes on with GHC whilst no-one is doing anything about the gaping holes in it.

WRT bootstrapping from scratch. I recently came across (sorry I can't remember where) somebody bootstrapping GHC, starting from a very old version and using that to build later releases. How had they started? With Hugs of course: it's the only viable Haskell for bootstrapping.

#3 Anthony Clayden am 2018-12-14

Why is it relevant to mention "make good progress bootstrapping a full system from just a C compiler written in Scheme, and a Scheme interpreter written in C, that can build each other"?

Even supposing that were possible in Haskell: you (presumably) didn't write the Scheme interpreter source nor the C compiler source, so why are they more trustworthy/why is the process more valid than just downloading some random compiler? And presumably they in the deep past relied on some sort of Assembler and operating environment and ... Do we have to go back to somebody punching toggle switches to put the bootstrap OS loader into the bottom 16 words of core?

Specifically, why is this not valid:

  • Compile to (or via) LLVM so that there's architecture-independent object code. (An LLVM interpreter is no less valid than the Scheme interpreter?)
  • Grab some random Haskell compiler, at least modern enough to do this:
  • Compile your preferred version of GHC -- call the result 'mongrel'.
  • Use mongrel to compile your preferred Prelude, FFI, libraries, other infrastructure.
  • Then use mongrel+environment to compile preferred GHC -- call the result 'half-mongrel'.
  • Rinse and repeat until you get to a fixpoint. You're at each cycle diluting the genes from your original grab. That is, compiling using (say) one sixteenth-mongrel produces the same LLVM code as one 32nd-mongrel.

By then haven't you removed any influence from the 'untrustworthy' compiler?

OK there's no guarantee you'll reach a fixpoint. OTOH I'm not seeing why you wouldn't do so eventually.

#4 Anthony Clayden am 2018-12-15

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.