rec-def: Behind the scenes

Published 2022-09-10 in sections English, Haskell.

A week ago I wrote about the rec-def Haskell library, which allows you to write more recursive definitions, such as in this small example:

let s1 = RS.insert 23 s2
    s2 = RS.insert 42 s1
in RS.get s1

This will not loop (as it would if you’d just used Data.Set), but rather correctly return the set S.fromList [23,42]. See the previous blog post for more examples and discussion of the user-facing side of this.

For quick reference, these are the types of the functions involved here:

insert :: Ord a => a -> RSet a -> RSet a
get :: RSet a -> Set a

The type of s1 and s2 above is not Set Int, but rather RSet Int, and in this post I’ll explain how RSet works internally.

Propagators, in general

The conceptual model behind an recursive equation like above is

There are a multiple cells that can hold values of an underlying type (here Set)
These cells have relations that explain how the values in the cells should relate to each other
After registering all the relations, some form of solving happens.
If the solving succeeds, we can read off the values from the cells, and they should satisfy the registered relation.

This is sometimes called a propagator network, and is a quite general model that can support different kind of relations (e.g. equalities, inequalities, functions), there can be various solving strategies (iterative fixed-points, algebraic solution, unification, etc.) and information can flow on along the edges (and hyper-edges) possibly in multiple directions.

For our purposes, we only care about propagator networks where all relations are functional, so they have a single output cell that is declared to be a function of multiple (possibly zero) input cells, without affecting these input cells. Furthermore, every cell is the output of exactly one such relation.

IO-infested propagator interfaces

This suggests that an implementation of such a propagator network could provide an interface with the following three operations:

Functions to declare cells
Functions to declare relations
Functions to read values off cells

This is clearly an imperative interface, so we’ll see monads, and we’ll simply use IO. So concretely for our small example above, we might expect

data Cell a
newCell :: IO (Cell a)
-- declareInsert x c1 c2 declares that the value of c1 ought to be S.insert x c2
declareInsert :: Ord a => a -> Cell a -> Cell a -> IO ()
getCell :: Cell a -> IO (Set a)

There is no need for an explicit “solve” function: solving can happen when declareInsert or getCell is called – as a User I do not care about that.

You might be curious about the implementation of newCell, declareInsert and getCell, but I have to disappoint you: This is not the topic of this article. Instead, I want to discuss how to turn this IO-infested interface into the pure interface seen above?

Pure, but too strict

Obviously, we have to get rid of the IO somehow, and have to use unsafePerformIO :: IO a -> a somehow. This dangerous function creates a pure-looking value that, when used the first time, will run the IO-action and turn into that action’s result.

So maybe we can simply write the following:

type RSet a = Cell a
insert x c2 = unsafePerformIO $ do
  c1 <- newCell
  declareInsert x c1 c2
  return c1
get c = unsafePerformIO $
  getCell c

Indeed, the types line up, but if we try to use that code, nothing will happen. Our insert is too strict to be used recursively: It requires the value of c2 (as it is passed to declareInsert, which we assume to be strict in its arguments) before it can return c1, so the recursive example at the top of this post will not make any progress.

Pure, lazy, but forgetful

To work around this, maybe it suffices if we do not run declareInsert right away, but just remember that we have to do it eventually? So let’s introduce a new data type for RSet a that contains not just the cell (Cell a), but also an action that we still have to run before getting a value:

data RSet a = RSet (Cell a) (IO ())
insert x r2 = unsafePerformIO $ do
  c1 <- newCell
  let todo = do
        let (RSet c2 _) = r2
        declareInsert x c1 c2
  return (RSet c1 todo)
get (RSet c todo) = unsafePerformIO $ do
  todo
  getCell c

This is better: insert is now lazy in its arguments (for this it is crucial to pattern-match on RSet only inside the todo code, not in the pattern of insert!) This means that our recursive code above does not get stuck right away.

Pure, lazy, but runs in circles

But it is still pretty bad: Note that we do not run get s2 in the example above, so that cell’s todo, which would declareInsert 42, will never run. This cannot work! We have to (eventually) run the declaration code from all involved cells before we can use getCell!

We can try to run the todo action of all the dependencies as part of a cell’s todo action:

data RSet a = RSet (Cell a) (IO ())
insert x r2 = unsafePerformIO $ do
  c1 <- newCell
  let todo = do
        let (RSet c2 todo2) = r2
        declareInsert x c1 c2
        todo2
  return (RSet c1 todo)
get (RSet c todo) = unsafePerformIO $ do
  todo
  getCell c

Now we certainly won’t forget to run the second cell’s todo action, so that is good. But that cell’s todo action will run the first cell’s todo action, and that again the second cell’s, and so on.

Pure, lazy, terminating, but not thread safe

This is silly: We only need (and should!) run that code once! So let’s keep track of whether we ran it already:

data RSet a = RSet (Cell a) (IO ()) (IORef Bool)
insert x r2 = unsafePerformIO $ do
  c1 <- newCell
  done <- newIORef False
  let todo = do
        is_done <- readIORef done
        unless is_done $ do
          writeIORef done True
          let (RSet c2 todo2 _) = r2
          declareInsert x c1 c2
          todo2
  return (RSet c1 todo done)
get (RSet c todo _) = unsafePerformIO $ do
  todo
  getCell c

Ah, much better: It works! Our call to get c1 will trigger the first cell’s todo action, which will mark it as done before calling the second cell’s todo action. When that now invokes the first cell’s todo action, it is already marked done and we break the cycle, and by the time we reach getCell, all relations have been correctly registered.

In a single-threaded world, this would be all good and fine, but we have to worry about multiple threads running get concurrently, on the same or on different cells.

In fact, because we use unsafePerformIO, we have to worry about this even when the program is not using threads.

And the above code has problems. Imagine a second call to get c1 while the first one has already marked it as done, but has not finished processing all the dependencies yet: It will call getCell before all relations are registered, which is bad.

Recursive do-once IO actions

Making this thread-safe seems to be possible, but would clutter both the code and this blog post. So let’s hide that problem behind a nice and clean interface. Maybe there will be a separate blog post about its implementation (let me know if you are curious), or you can inspect the code in System.IO.RecThunk module yourself). The interface is simply

data Thunk
thunk :: IO [Thunk] -> IO Thunk
force :: Thunk -> IO ()

and the idea is that thunk act will defer the action act until the thunk is passed to force for the first time, and force will not return until the action has been performed (possibly waiting if another thread is doing that at the moment), and also until the actions of all the thunks returned by act have performed, recursively, without running into cycles.

We can use this in our definition of RSet and get to the final, working solution:

data RSet a = RSet (Cell a) Thunk
insert x r2 = unsafePerformIO $ do
  c1 <- newCell
  t1 <- thunk $ do
    let (RSet c2 t2) = r2
    declareInsert x c1 c2
    return [t2]
  return (RSet c1 t1)
get (RSet c t) = unsafePerformIO $ do
  force t
  getCell c

This snippet captures the essential ideas behind rec-def:

Use laziness to allow recursive definition to describe the propagator graph naturally
Use a form of “explicit thunk” to register the propagator graph relations at the right time (not too early/strict, not too late)

And that’s all?

The actual implementation in rec-def has a few more moving parts.

In particular, it tries to support different value types (not just sets), possibly with different implementations, and even mixing them (e.g. in member :: Ord a => a -> RSet a -> RBool), so the generic code is in Data.Propagator.Purify, and supports various propagators underneath. The type RSet is then just a newtype around that, defined in Data.Recursive.Internal to maintain the safety of the abstraction,

I went back and forth on a few variants of the design here, including one where there was a generic R type constructor (R (Set a), R Bool etc.), but then monomorphic interface seems simpler.

Does it really work?

The big remaining question is certainly: Is this really safe and pure? Does it still behave like Haskell?

The answer to these questions certainly depends on the underlying propagator implementation. But it also depends on what we actually mean by “safe and pure”? For example, do we expect the Static Argument Transformation be semantics preserving? Or is it allowed to turn undefined values into defined ones (as it does here)?

I am unsure myself yet, so I’ll defer this discussion to a separate blog post, after I hopefully had good discussions about this here at ICFP 2022 in Ljubljana. If you are around and want to discuss, please hit me up!

Comments

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.

Go up