Joachim Breitner

Build tool semantic aware build systems

Published 2018-07-28 in sections English, Digital World.

I just had a lovely visit at Ben Gamari and Laura Dietz, and at the breakfast table we had an idea that I want to record here.

The problem

Laura and Ben talked about the struggles they had using build systems like make or Nix in data science applications. A build system like nix is designed around the idea that builds are relatively cheap, and that any change in a dependency ought to trigger a rebuild, just to be sure that all build outputs are up-to-date. This is a perfectly valid assumption when building software, where build times are usually minutes, or maybe hours. But when some of the data processing takes days, then you really really want to avoid any unnecessary builds.

One way of avoiding unnecessary rebuilds that is supported by build systems like shake and (I presume, maybe with some extra work) Nix, is to do output hashing: If one build stage has its input changed and need to be re-run, but its output is actually unchanged, then later build stages do not have to be re-run. Great!

But even with this feature in place, there one common problem remains: Build tools! Your multi-gigabyte data is going to be processed by some code you write. What if the code changes? Clearly, if you change the algorithm, or fix a bug in your code, you want the output to be re-run. But if you just change some strings in the --help flag, or update some libraries, or do something else that does not change the program in a way that is significant for the task at hand, wouldn’t you prefer to not pay for that by futile multi-day data processing step?

Existing approaches

There are various ways you can tackle this these days:

  • You bite the bullet, add the build tool as a dependency of the processing step. You get the assurance that your data always reflects the output of the latest version of your tool, but you get lots of rebuilds.

  • You don’t track the tool as part of your build system. It is now completely up to you to decide when the tool has changed in significant ways. When you think it has, you throw away all build artifacts and start from scratch. Very crude, and easy to get wrong.

  • You keep track of a “build tool version”, e.g. a text file with a number, that you depend on in lieu of the build tool itself. When you make a change that you think is significant, you bump that version number. This is similar to the previous approach, but more precise: Only the build outputs that used this particular tools will be invalidated, and it also scales better to working in a team. But of course, you can still easily get it wrong.

Build tool semantics

Why are all these approaches so unsatisfying? Because none allow you to express the intent, which is to say “this build step depends on the semantics (i.e. behaviour) of the build tool”. If we could somehow specify that, then all would be great: Build tool changes, but its semantics is unaffected? no rebuild. Build tool changes, semantics change? rebuild.

This ideal is not reachable (blah blah halting problem blah blah) – but I believe we can approximate it. At least if good practices were used and the tool has a test suite!

Assume for now that the tool is a simple patch-processing tool, i.e. takes some input and produces some output. Assume further that there is a test suite with a small but representative set of example inputs, or maybe some randomly generated inputs (using a fixed seed). If the test suite is good, then the (hash of) the output on the test suite example is an approximation of the programs semantic.

And the build system can use this “semantics hash”. If the build tool changes, the build system can re-run the test suite and compare the output with the previous run. If they change, then the tools seems to have changed in significant ways, and it needs to re-run the data processing. But if the test suite outputs are unchanged, then it can assume that the behaviour of the tool has not changed, re-use the existing data outputs.

That is the core idea! A few additional thoughts, though:

  • What if the test suite changes? If the above idea is implemented naively, then adding a test case to the test suite will change the semantics, and re-build everything. That would be horrible in terms of incentives! So instead, the build systems should keep the old version of the build tool around, re-calculate its semantics hash based on the new test suite, and then compare that. This way, extending the test suite does not cause recompilation.

    Hash-and-store-based build systems like Nix should have no problem keeping the previous version of the build tool around as long there is output that depends on it.

  • A consequence of this approach is that if you find and fix a bug in your tool that is not covered by the existing test suite, then you absolutely have to add a test for that to your test suite, otherwise the bug fix will not actually make it to your data output. This might be the biggest downside of the approach, but at least it sets incentives right in that it makes you maintain a good regression test suite.

  • What if things go wrong, the test suite is incomplete and the build system re-uses build output because it wrongly assumes that two versions of the build tool have the same semantics?

    If done right, this can be detected and fixed: The build system needs to record which tool version (and not just which “semantics hash”) was used to produce a particular output. So once the test suite uncovers the difference, the build systems will no longer consider the two versions equivalent and – after the fact – invalidate the re-used of the previous build output, and re-build what needs to be rebuild

I am curious to hear if anybody has played with these or similar ideas before? How did it go? Would it be possible to implement this in Nix? In Shake? Other systems? Tell me your thoughts!


My immediate thought upon seeing your blog post (as well as “yay another smart person thinking about build systems”) was to wonder if you’d seen the recent “Build Systems à la Carte” paper by Andrey Mokhov, Neil Mitchell and Simon Peyton Jones? It’s definitely worth a look!

#1 Jonathan Dowland am 2018-07-30

I’ve been thinking about these issues for a while too (a while = “the last 10 years, but not my #1 job”) and the idea of linking to the test suite is interesting. Here is what I have done in the past in this domain:

For jug, I ended up with the “invalidate” command that says “consider all the computations done with this step to be invalid”. This works recursively so that transitive dependencies will also be recomputed. It’s error-prone, but I worried about the incentives of automatic detection and it’s hard to do it reliably in Python anyway.

For ngless, we explicitly set a version and hash it, but for example, both ngless “0.9.0” and “0.9.1” are identified as “0.9”. Furthermore, if we run a script that asks for “ngless 0.8”, we run it in that mode so that it computes as if it had run with ngless 0.8.0 or 0.8.1.

This depends on the tool making a guarantee to its users that there are no semantic changes from 0.9.0 -> 0.9.1, but it is easier on the users.

Our test suite does have to be updated for every version bump as a few tests include the version in the output! We don’t consider that in our internal hashing, but an automated system might get tripped up about this.

#2 Luis Pedro Coelho am 2018-07-30

Depending on a version string is of course good, but requires you to predict in advance when things change. Depending on the test suite output fixes the to predict requirement, and if the build system allows you to update the test suite independently of the “what build artifacts can be re-used” fixes the in advance requirement. These two ideas are somewhat orthogonal.

#3 Joachim am 2018-08-04

By the way, this might be a bit of prior art for the build system discussion we had last week:

#4 Ben Gamari am 2018-07-30

Have something to say? You can post a comment by sending an e-Mail to me at <>, and I will include it here.