Joachim Breitner

ghc-heap-view: Complete referential opacity

Published 2012-03-13 in sections English, Haskell.

During the last week, I created ghc-heap-view, a library to investigate the actual memory representation of Haskell values. It is inspired by vacuum and the GHCi debugger, but goes beyond them by allowing the user to look inside thunks and functions and see what other values they refer to. Let me demonstrate it by running the included demo:

ghc-heap-view-demo

Here are a four different lists, where the first three are already evaluated.

The first one, l, was defined as a top level constant as

> l = [1,2,3]

and is now found at 0x00000000006d1750/2 (where the /2 is the pointer tag information) and fully evaluated:

    ConsClosure {info = StgInfoTable {ptrs = 2, nptrs = 0, tipe = CONSTR_STATIC, srtlen = 1}, 
                 ptrArgs = [0x00000000006d16e0/1,0x00000000006d1730/2],
                 dataArgs = [], descr = "ghc-prim:GHC.Types.:"}

The second one, l2, is locally defined

> let l2 = 4:l

and now found at 0x00007fdce19fe4b0/2. See how the cons-cell references l!

    ConsClosure {info = StgInfoTable {ptrs = 2, nptrs = 0, tipe = CONSTR_2_0, srtlen = 1},
                 ptrArgs = [0x00000000006dca50/1,0x00000000006d1750/2],
                 dataArgs = [],
                 descr = "ghc-prim:GHC.Types.:"}

And the binding

> args <- map length `fmap` getArgs

evaluates to the “one”, global empty list at 0x00000000006db640/1:

    ConsClosure {info = StgInfoTable {ptrs = 0, nptrs = 0, tipe = CONSTR_NOCAF_STATIC, srtlen = 0},
                 ptrArgs = [],
                 dataArgs = [],
                 descr = "ghc-prim:GHC.Types.[]"}

And now we have, at 0x00007fdce19fe4c8, the concatenation of them, but unevaluated:

> let x = l ++ l2 ++ args

The thunk keeps a reference to l2 and args, but not l, as that is at a static address, unless you are running this in GHCi:

    ThunkClosure {info = StgInfoTable {ptrs = 2, nptrs = 0, tipe = THUNK_2_0, srtlen = 1},
                  ptrArgs = [0x00007fdce19fe4b0/2,0x00000000006db640/1],
                  dataArgs = []}

Now to some more closure types. m and m' locally bound of type the unboxed type Int#, with values 42 resp. 23.

> let f = \x n -> take (I# m + I# x) n ++ args
      t = f m' l2

So here is (0x00007fdce1937d50/2), referencing its free variables args and 42:

    FunClosure {info = StgInfoTable {ptrs = 1, nptrs = 1, tipe = FUN_1_1, srtlen = 65553},
                ptrArgs = [0x00000000006db640/1],
                dataArgs = [42]}

And t is a thunk that applies f (also referenced here) to an unboxed value (23) and l2:

    ThunkClosure {info = StgInfoTable {ptrs = 2, nptrs = 1, tipe = THUNK, srtlen = 0},
                  ptrArgs = [0x00007fdce19fe4b0/2,0x00007fdce1937d50/2],
                  dataArgs = [23]}

Lastly, here is the standard example for self reference:

> let x = id (:) () x

This is what x (0x00007fdce1947940) looks like, at least without -O:

    ThunkClosure {info = StgInfoTable {ptrs = 0, nptrs = 0, tipe = THUNK, srtlen = 1},
                  ptrArgs = [],
                  dataArgs = []}

So it is unevaluated. Let us evaluate it using seq. Now we have, still at 0x00007fdce1947940:

    IndClosure {info = StgInfoTable {ptrs = 1, nptrs = 0, tipe = BLACKHOLE, srtlen = 0},
                indirectee = 0x00007fdce194cc98/2}

The thunk was replaced by an indirection. If we look at the target, 0x00007fdce194cc98/2, we see that it is a newly created cons-cell referencing the original location of x:

    ConsClosure {info = StgInfoTable {ptrs = 2, nptrs = 0, tipe = CONSTR_2_0, srtlen = 1},
                 ptrArgs = [0x00000000006db620/1,0x00007fdce1947940],
                 dataArgs = [],
                 descr = "ghc-prim:GHC.Types.:"}

After running the garbage collector (performGC), we find that the address of x is now 0x00007fdce19f30d0/2 and that the self-reference is without indirections:

    ConsClosure {info = StgInfoTable {ptrs = 2, nptrs = 0, tipe = CONSTR_2_0, srtlen = 1},
                 ptrArgs = [0x00000000006db620/1,0x00007fdce19f30d0/2],
                 dataArgs = [],
                 descr = "ghc-prim:GHC.Types.:"}

Future plans

The output of ghc-heap-view is not really pretty yet; even the indentation in this blog post was added manually by me, so this really needs a pretty printer providing a nicer, possibly more compact representation, including something like what vacuum provides. Maybe vacuum can be ported to use this library, and also include the thunk’s and function’s references in the output. Maybe also the GHCi debugger can be extended to show more information about unevaluated expressions using this. Internally, the library is not very polished yet either. It only handles those closures types that I have seen so far, and is likely to break horribly if run in a threaded or debugging enabled runtime.

How it works

Obviously, this is not standard Haskell 98 code, but rather deep trickery involving the GHC API and some C code. Initially I tried to use the API that vacuum and the GHCi debugger rely on, which is an operation

unpackClosure# :: a -> (# Addr#, Array# b, ByteArray# #)

which takes any Haskell value and returns the address to its info table, the pointers and the non-pointer-data in the closure. Unfortunately, it was not complete in that it was meant only for data closures and will for other closure types, e.g. thunks, return no data and no pointers (as can be seen in the code). So I implemented my own version of this operation:

foreign import prim "slurpClosurezh" slurpClosure# :: Any -> (# Addr#, ByteArray#, Array# b #)

where the returned ByteArray# contains the complete closure, including extra information fields such as the arity of a function. The Array# is again the list of pointers in the closure. At first glance, this is a duplication, as the pointers are of course also contained in the ByteArray#. But as soon as the GHC runtime reigns again, a garbage collector run can happen, the referenced values will move somewhere else, and the words that once were pointers in the ByteArray# become useless. But the corresponding entries in the Array# are updated by the garbage collector, as it knows that these are pointers, and not just words. This way, we get both a faithful copy of the closure on the heap and useful references to the contained data. Here is a demonstration of this effect:

$ ghci -XMagicHash -package ghc-heap-view
GHCi, version 7.4.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
[..]
Loading package ghc-7.4.1 ... linking ... done.
Loading package ghc-heap-view-0.1 ... linking ... done.
Prelude> let {a = [1,2,3,4]; b = 5:a}
Prelude> :m + GHC.HeapView 
Prelude GHC.HeapView> rawHeapData <- getClosureRaw b
Prelude GHC.HeapView> rawHeapData 
(0x000000004080d658,[1082185320,140040739366568,140040739365928],[0x00007f5dc68626a8,0x00007f5dc6862428])
Prelude GHC.HeapView> System.Mem.performGC
Prelude GHC.HeapView> rawHeapData 
(0x000000004080d658,[1082185320,140040739366568,140040739365928],[0x00007f5dc41b3ad8,0x00007f5dc41b3b28]) 

The function rawHeapData is a thin wrapper around slurpClosure# which turns the primitive array in normal lists. Note that the second component of the triple is unchanged, but the third is updated by the garbage collector. Of course this means that the Show instance for the data type that ghc-heap-view uses to reference values is not referential transparent either.

The foreign function import above is of type “prim”, i.e. does not call a C function but rather a Cmm function. Cmm is a reduced C that GHC uses internally to compile the Haskell code to, and most primitive operations are implemented in this language – although I do quickly call regular C from my Cmm code to do the more complicated stuff, mainly figuring out what words of the closure are pointers.

The knowledgeable reader might notice that I am passing a boxed value of type Any to the foreign function. This is currently not possible with foreign prim functions, and to actually use that code, you need the patch in GHC ticket #5931. But you can use ghc-heap-view without that as well (and the Cabal package will by default use that path), using the following hack to obtain the pointer to a Haskell value on the Heap as an unboxed type that can pass to the primitive operation:

foreign import prim "slurpClosurezh" slurpClosure'# :: Word#  -> (# Addr#, ByteArray#, Array# b #)
data Ptr' a = Ptr' a
aToWord# :: Any -> Word#
aToWord# a = case Ptr' a of mb@(Ptr' _) -> case unsafeCoerce# mb :: Word of W# addr -> addr
slurpClosure# :: Any -> (# Addr#, ByteArray#, Array# b #)
slurpClosure# a = slurpClosure'# (aToWord# a)

This works because a Word and a Ptr' have the same closure layout, only differing in the fact that one stores an a, and the other stores a Word#.

Once we obtained the raw representation of the closure, we do the parsing in Haskell. Using the info table and the raw closure, we have enough information to tell which words have to be replaced by the appropriate pointer (which might already have been updated by the garbage collector) in the pointers list.

This work was supported by a scholarship from the Deutsche Telekom Stiftung.

Comments

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.