During the last week, I created ghc-heap-view, a library to investigate the actual memory representation of Haskell values. It is inspired by vacuum and the GHCi debugger, but goes beyond them by allowing the user to look inside thunks and functions and see what other values they refer to. Let me demonstrate it by running the included demo:
Here are a four different lists, where the first three are already evaluated. The first one, l, was defined as a top level constant as
> l = [1,2,3]
and is now found at 0x00000000006d1750/2 (where the /2 is the pointer tag information) and fully evaluated:
The output of ghc-heap-view is not really pretty yet; even the indentation in this blog post was added manually by me, so this really needs a pretty printer providing a nicer, possibly more compact representation, including something like what vacuum provides. Maybe vacuum can be ported to use this library, and also include the thunk’s and function’s references in the output. Maybe also the GHCi debugger can be extended to show more information about unevaluated expressions using this. Internally, the library is not very polished yet either. It only handles those closures types that I have seen so far, and is likely to break horribly if run in a threaded or debugging enabled runtime.
How it works
Obviously, this is not standard Haskell 98 code, but rather deep trickery involving the GHC API and some C code. Initially I tried to use the API that vacuum and the GHCi debugger rely on, which is an operation
which takes any Haskell value and returns the address to its info table, the pointers and the non-pointer-data in the closure. Unfortunately, it was not complete in that it was meant only for data closures and will for other closure types, e.g. thunks, return no data and no pointers (as can be seen in the code). So I implemented my own version of this operation:
foreign import prim "slurpClosurezh" slurpClosure# :: Any -> (# Addr#, ByteArray#, Array# b #)
where the returned ByteArray# contains the complete closure, including extra information fields such as the arity of a function. The Array# is again the list of pointers in the closure. At first glance, this is a duplication, as the pointers are of course also contained in the ByteArray#. But as soon as the GHC runtime reigns again, a garbage collector run can happen, the referenced values will move somewhere else, and the words that once were pointers in the ByteArray# become useless. But the corresponding entries in the Array# are updated by the garbage collector, as it knows that these are pointers, and not just words. This way, we get both a faithful copy of the closure on the heap and useful references to the contained data. Here is a demonstration of this effect:
The function rawHeapData is a thin wrapper around slurpClosure# which turns the primitive array in normal lists. Note that the second component of the triple is unchanged, but the third is updated by the garbage collector. Of course this means that the Show instance for the data type that ghc-heap-view uses to reference values is not referential transparent either.
The foreign function import above is of type “prim”, i.e. does not call a C function but rather a Cmm function. Cmm is a reduced C that GHC uses internally to compile the Haskell code to, and most primitive operations are implemented in this language – although I do quickly call regular C from my Cmm code to do the more complicated stuff, mainly figuring out what words of the closure are pointers.
The knowledgeable reader might notice that I am passing a boxed value of type Any to the foreign function. This is currently not possible with foreign prim functions, and to actually use that code, you need the patch in GHC ticket #5931. But you can use ghc-heap-view without that as well (and the Cabal package will by default use that path), using the following hack to obtain the pointer to a Haskell value on the Heap as an unboxed type that can pass to the primitive operation:
foreign import prim "slurpClosurezh" slurpClosure'# :: Word# -> (# Addr#, ByteArray#, Array# b #)
data Ptr' a = Ptr' a
aToWord# :: Any -> Word#
aToWord# a = case Ptr' a of mb@(Ptr' _) -> case unsafeCoerce# mb :: Word of W# addr -> addr
slurpClosure# :: Any -> (# Addr#, ByteArray#, Array# b #)
slurpClosure# a = slurpClosure'# (aToWord# a)
This works because a Word and a Ptr' have the same closure layout, only differing in the fact that one stores an a, and the other stores a Word#.
Once we obtained the raw representation of the closure, we do the parsing in Haskell. Using the info table and the raw closure, we have enough information to tell which words have to be replaced by the appropriate pointer (which might already have been updated by the garbage collector) in the pointers list.