## Left recursive parser combinators via sharing

Published 2023-09-10 in sections English, Haskell.

At this year’s ICFP in Seattle I gave a talk about my rec-def Haskell library, which I have blogged about before here. While my functional pearl paper focuses on a concrete use-case and the tricks of the implementation, in my talk I put the emphasis on the high-level idea: it beholds of a declarative lazy functional like Haskell that recursive equations just work whenever they describe a (unique) solution. Like in the paper, I used equations between sets as the running example, and only conjectured that it should also work for other domains, in particular parser combinators.

Naturally, someone called my bluff and asked if I actually tried it. I had not, but I should have, because it works nicely and is actually more straight-forward than with sets. I wrote up a prototype and showed it off a few days later as a lightning talk at Haskell Symposium; here is the write up that goes along with that.

### Parser combinators

Parser combinators are libraries that provide little functions (combinators) that you compose to define your parser directly in your programming language, as opposed to using external tools that read some grammar description and generate parser code, and are quite popular in Haskell (e.g. parsec, attoparsec, megaparsec).

Let us define a little parser that recognizes sequences of `a`s:

``````ghci> let aaa = tok 'a' *> aaa <|> pure ()
ghci> parse aaa "aaaa"
Just ()
ghci> parse aaa "aabaa"
Nothing``````

### Left-recursion

This works nicely, but just because we were lucky: We wrote the parser to recurse on the right (of the `*>`), and this happens to work. If we put the recursive call first, it doesn’t anymore:

``````ghci> let aaa = aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
^CInterrupted.``````

This is a well-known problem (see for example Nicolas Wu’s overview paper), all the common parser combinator libraries cannot handle it and the usual advise is to refactor your grammar to avoid left recursion.

But there are some libraries that can handle left recursion, at least with a little help from the programmer. I found two variations:

I took the module from the Agda source and simplified a bit for the purposes of this demonstration (`Parser.hs`). Indeed, I can make the left-recursive grammar work:

``````ghci> let aaa = memoise ":-)" \$ aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
Just ()
ghci> parse aaa "aabaa"
Nothing``````

It does not matter what I pass to `memoise`, as long as I do not pass the same key when memoising a different parser.

For reference, an excerpt of the the API of `Parser`:

``````data Parser k tok a -- k is type of keys, tok type of tokens (e.g. Char)
instance Functor (Parser k tok)
instance Applicative (Parser k tok)
instance Alternative (Parser k tok)
parse :: Parser k tok a -> [tok] -> Maybe a
sat :: (tok -> Bool) -> Parser k tok tok
tok :: Eq tok => tok -> Parser k tok tok
memoise :: Ord k => k -> Parser k tok a -> Parser k tok a``````

### Left-recursion through sharing

To follow the agenda set out in my talk, I now want to wrap that parser in a way that relieves me from having to insert the calls to `memoise`. To start, I import that parser qualified, define a newtype around it, and start lifting some of the functions:

``````import qualified Parser as P

newtype Parser tok a = MkP { unP :: P.Parser Unique tok a }

parse :: Parser tok a -> [tok] -> Maybe a
parses (MkP p) = P.parse p

sat :: Typeable tok => (tok -> Bool) -> Parser tok tok
sat p = MkP (P.sat p)

tok :: Eq tok => tok -> Parser tok tok
tok t = MkP (P.tok t)``````

So far, nothing interesting had to happen, because so far I cannot build recursive parsers. The first interesting combinator that allows me to do that is `<*>` from the `Applicative` class, so I should use `memoise` there. The question is: Where does the unique key come from?

### Proprioception

As with the rec-def library, pure code won’t do, and I have to get my hands dirty: I really want a fresh unique label out of thin air. To that end, I define the following combinator, with naming aided by Richard Eisenberg:

``````propriocept :: (Unique -> a) -> a
propriocept f = unsafePerformIO \$ f <\$> newUnique``````

A thunk defined with `propriocept` will know about it’s own identity, and will be able to tell itself apart from other such thunks. This gives us a form of observable sharing, precisely what we need. But before we return to our parser combinators, let us briefly explore this combinator.

Using `propriocept` I can define an operation `cons :: [Int] -> [Int]` that records (the hash of) this `Unique` in the list:

``````ghci> let cons xs = propriocept (\x -> hashUnique x : xs)
ghci> :t cons
cons :: [Int] -> [Int]``````

This lets us see the identity of a list cell, that is, of the concrete object in memory.

Naturally, if we construct a finite list, each list cell is different:

``````ghci> cons (cons (cons []))
[1,2,3]``````

And if we do that again, we see that fresh list cells are allocated:

``````ghci> cons (cons (cons []))
[4,5,6]``````

We can create an infinite list; if we do it without sharing, every cell is separate:

``````ghci> take 20 (acyclic 0)
[7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]``````

but if we tie the knot using sharing, all the cells in the list are actually the same:

``````ghci> let cyclic = cons cyclic
ghci> take 20 cyclic
[27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27]``````

We can achieve the same using `fix` from `Data.Function`:

``````ghci> import Data.Function
ghci> take 20 (fix cons)
[28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28]``````

I explore these heap structures more visually in a series of screencasts.

So with `propriocept` we can distinguish different heap objects, and also recognize when we come across the same heap object again.

### Left-recursion through sharing (cont.)

With that we return to our parser. We define a smart constructor for the new `Parser` that passes the unique from `propriocept` to the underlying parser’s `memoise` function:

``````withMemo :: P.Parser Unique tok a -> Parser tok a
withMemo p = propriocept \$ \u -> MkP \$ P.memoise u p``````

If we now use this in the definition of all possibly recursive parsers, then the necessary calls to `memoise` will be in place:

``````instance Functor (Parser tok) where
fmap f p = withMemo (fmap f (unP p))

instance Applicative (Parser tok) where
pure x = MkP (pure x)
p1 <*> p2 = withMemo (unP p1 <*> unP p2)

instance Alternative (Parser tok) where
empty = MkP empty
p1 <|> p2 = withMemo (unP p1 <|> unP p2)

return = pure
p1 >>= f = withMemo \$ unP p1 >>= unP . f``````

And indeed, it works (see `RParser.hs` for the full code):

``````ghci> let aaa = aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
Just ()
ghci> parse aaa "aabaa"
Nothing``````

### A larger example

Let us try this on a larger example, and parse (simple) BNF grammars. Here is a data type describing them

``````type Ident = String
type RuleRhs = [Seq]
type Seq = [Atom]
data Atom = Lit String | NonTerm Ident deriving Show
type Rule = (Ident, RuleRhs)
type BNF = [Rule]``````

For the concrete syntax, I’d like to be able to parse something like

``````numExp :: String
numExp = unlines
[ "term   := sum;"
, "pdigit := '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';"
, "digit  := '0' | pdigit;"
, "pnum   := pdigit | pnum digit;"
, "num    := '0' | pnum;"
, "prod   := atom | atom '*' prod;"
, "sum    := prod | prod '+' sum;"
, "atom   := num | '(' term ')';"
]``````

so here is a possible parser; mostly straight-forward use of parser combinator:

``````type P = Parser Char

snoc :: [a] -> a -> [a]
snoc xs x = xs ++ [x]

l :: P a -> P a
l p = p <|> l p <* sat isSpace
quote :: P Char
quote = tok '\''
quoted :: P a -> P a
quoted p = quote *> p <* quote
str :: P String
str = some (sat (not . (== '\'')))
ident :: P Ident
ident = some (sat (\c -> isAlphaNum c && isAscii c))
atom :: P Atom
atom = Lit     <\$> l (quoted str)
<|> NonTerm <\$> l ident
eps :: P ()
eps = void \$ l (tok 'ε')
sep :: P ()
sep = void \$ some (sat isSpace)
sq :: P Seq
sq = []   <\$ eps
<|> snoc <\$> sq <* sep <*> atom
<|> pure <\$> atom
ruleRhs :: P RuleRhs
ruleRhs = pure <\$> sq
<|> snoc <\$> ruleRhs <* l (tok '|') <*> sq
rule :: P Rule
rule = (,) <\$> l ident <* l (tok ':' *> tok '=') <*> ruleRhs <* l (tok ';')
bnf :: P BNF
bnf = pure <\$> rule
<|> snoc <\$> bnf <*> rule``````

I somewhat sillily used `snoc` rather than `(:)` to build my lists, just so that I can show off all the left-recursion in this grammar.

### Sharing is tricky

Let’s try it:

``````ghci> parse bnf numExp
^CInterrupted.``````

What a pity, it does not work! What went wrong?

The underlying library can handle left-recursion if it can recognize it by seeing a `memoise` label passed again. This works fine in all the places where we re-use a parser definition (e.g. in `bnf`), but it really requires that values are shared!

If we look carefully at our definition of `l` (which parses a lexeme, i.e. something possibly followed by whitespace), then it recurses via a fresh function call, and the program will keep expanding the definition – just like the `acyclic` above:

``````l :: P a -> P a
l p = p <|> l p <* sat isSpace``````

The fix (sic!) is to make sure that the recursive call is using the parser we are currently defining, which we can easily do with a local definition:

``````l :: P a -> P a
l p = p'
where p' = p <|> p' <* sat isSpace``````

With this little fix, the parser can parse the example grammar:

``````ghci> parse bnf numExp
Just [("term",[[NonTerm "sum"]]),("pdigit",[[Lit "1"],…``````

### Going meta

The main demonstration is over, but since we now have already have a parser for grammar descriptions at hand, let’s go a bit further and dynamically construct a parser from such a description. The parser should only accept strings according to that grammar, and return a parse tree annotated with the non-terminals used:

``````interp :: BNF -> P String
interp bnf = parsers M.! start
where
start :: Ident

parsers :: M.Map Ident (P String)
parsers = M.fromList [ (i, parseRule i rhs) | (i, rhs) <- bnf ]

parseRule :: Ident -> RuleRhs -> P String
parseRule ident rhs = trace <\$> asum (map parseSeq rhs)
where trace s = ident ++ "(" ++ s ++ ")"

parseSeq :: Seq -> P String
parseSeq = fmap concat . traverse parseAtom

parseAtom :: Atom -> P String
parseAtom (Lit s) = traverse tok s
parseAtom (NonTerm i) = parsers M.! i``````

Let’s see it in action (full code in `BNFEx.hs`):

``````ghci> Just bnfp = parse bnf numExp
ghci> :t bnfp
bnfp :: BNF
ghci> parse (inter
interact  interp
ghci> parse (interp bnfp) "12+3*4"
Just "term(sum(prod(atom(num(pnum(pnum(pdigit(1))digit(pdigit(2))))))+sum(prod(atom(num(pnum(pdigit(3))))*prod(atom(num(pnum(pdigit(4)))))))))"
ghci> parse (interp bnfp) "12+3*4+"
Nothing``````

It is worth noting that the `numExp` grammar is also left-recursive, so implementing `interp` with a conventional parser combinator library would not have worked. But thanks to our `propriocept` tick, it does! Again, the sharing is important; in the code above it is the map `parsers` that is defined in terms of itself, and will ensure that the left-recursive productions will work.

### Closing thoughts

I am using `unsafePerformIO`, so I need to justify its use. Clearly, `propriocept` is not a pure function, and it’s type is a lie. In general, using it will break the nice equational properties of Haskell, as we have seen in our experiments with `cons`.

In the case of our parser library, however, we use it in specific ways, namely to feed a fresh name to `memoise`. Assuming the underlying parser library’s behavior does not observably depend on where and with which key `memoise` is used, this results in a properly pure interface, and all is well again. (NB: I did not investigate if this assumption actually holds for the parser library used here, or if it may for example affect the order of parse trees returned.)

I also expect that this implementation, which will memoise every parser involved, will be rather slow. It seems plausible to analyze the graph structure and figure out which `memoise` calls are actually needed to break left-recursion (at least if we drop the `Monad` instance or always memoise `>>=`).

If you liked this post, you might enjoy reading the paper about rec-def, watch one of my talks about it (MuniHac, BOBKonf, ICFP23; the presentation evolved over time), or if you just want to see more about how things are laid out on Haskell’s heap, go to my screen casts exploring the Haskell heap.