Wednesday, July 25, 2018

A Typed, Algebraic Approach to Parsing

(Note: If you're on the POPL PC, please don't read this post until after your reviewing is done.)

It's been over 20 years since Graham Hutton and Erik Meijer wrote their wonderful JFP paper on monadic combinator parsing. Do we need another paper about it?

Jeremy Yallop and I think the answer is yes! We have a new draft paper out, A Typed, Algebraic Approach to Parsing.

The fundamental issue is that parser combinators are very pretty, but suffer from two serious problems.

  1. First, parser combinator librariess tend to be implementation-defined.

    Context-free grammars have the very nice property that it's very easy to exactly define the set of strings that a parser for that grammar should match. This, in turn, makes it easy to prove that there are many grammar transformations which do not change that set. For example, if the BNF production A recognizes a language, and B recognizes a different language, then the production A | B will the union of the two languages recognized by A and B. As a result, swapping the order of an alternative from A | B to B | A will not change the language recognized.

    However, most implementations of parser combinators do not satisfy these kinds of properties! Swapping the order of the alternatives usually can change the parse tree returned from a parser built froom parser combinators. This mismatch between the high-level explanation and what the code actually does makes debugging and refactoring harder, and also makes it hard to understand what a combinator parser does without knowing the actual implementation of the library.

  2. Next, combinator parsing tends to be inefficient.

    The standard parser combinator algorithm is usually some variant of backtracking recursive descent, and that means that parse times can become exponential in the length of the input. Basically, in the worst case you have to save a backtrack point after seeing each character, which gives you exponentially many branches you may have to explore.

    There are two general classes of reaction to this problem. First, the combinator library can explicitly expose backtracking primitives to the user of the library, so that programmers can control when backtracking can occur. This does resolve the problem, but a drastic price: there is basically no specification of the accepted language beyond the actual code of the parser implementation.

    The second (and somhewat better) reaction is to switch to another parsing technology. The most notable such alternative is PEG (aka packrat) parsing. Now, the good thing about packrat parsing is that actually does have predictable linear-time parsing, but that comes at a pretty high cost: the memory usage of PEG parsers is linear in the input, and choice is not commutative.

Now, this all sounds like a pretty strong argument for sticking with traditional parser generators, whether LR (eg, yacc) or LL (eg, ANTLR). A BNF spec has a clear declarative reading, and the generated parsers are efficient, so what's not to like? Two things:

  1. Honestly, the BNF notation for context-free grammars has its own weaknesses. For example, BNF has no notion of binding or variable, and this makes it quite hard to build grammars in a modular fashion. For example, it's a good idea for all the sequence constructions in a language to uniformly treat delimiters as separators or terminators, but with BNF you have to manually define each sequence type, making it easy to screw this up. If you had the ability to build abstractions, you define a single sequence construction primitive which would let you make the choice once and for all. This is easy to do with parser combinators, but missing from most parser generators (Menhir being a notable exception).

  2. Furthermore, a problem common to all table-generating parser generators is that their error reporting is horrible -- when there is a bug in the grammar taking you out of the handled language class, the parser generator vomits its internal data structures over the programmer, who has to then pick through the chunks to figure out what the problems were. It would be much better if errors could be reported in terms of the actual grammar the programmer wrote!

So Jeremy and I basically set out to fix all these problems. Our paper proceeds via the following steps:

  1. First, we found a better notation for context-free languages than BNF. Basically, if you take regular expressions and add a least fixed operator to them, then you can recognize exactly the context free languages. This is not a new observation; it's been known since at least the early 1970s. But the payoff of replacing nonterminals with fixed point operators is that there is now a very clear story on variable scope and binding.

  2. As most of you know, we don't actually want to allow arbitrary context-free languages, since some of them are inherently ambiguous. So the next thing we do is we define a type system for our context-free expressions, which can statically identify unambiguous grammars which can be parsed efficiently (ie, by non-backtracking recursive descent with a single token of lookahead).

    The benefit of this is that type checking is local and syntactic, and so all our errors can be reported in terms of the grammar that the programmer wrote.

  3. Next, we define a family of parser combinators which operate on typed grammars. Our parser combinators have a very simple implementation story -- there's no backtracking and no fancy lookahead, so the type of parsers is as simple as can be. Moreoever, we can exploit the type information when parsing alternatives, by using a bit of lookahead to decide which branch to take.

  4. Our parser combinators have very predictable performance, but are still fairly slow, due to all the overhead of indirecting through function pointers (due to all the higher-order functions involved). So we use staged programming to eliminate this overhead. Basically, staging lets us eliminate all the overhead, leading to generated code which looks more or less exactly like the kinds of hand-coded recursive descent parsers that experts write by hand.

The resulting parsers have excellent performance -- in our benchmarks, we outperform ocamlyacc-generated code.

So we get the abstraction benefits of parser combinators, good static error reporting about ill-formed grammars, and excellent performance to boot!


  1. This is very interesting, when I first encountered parser combinators I was sort of hoping they do some optimizations internally, was both pleasantly surprised at their simplicity and disappointed that one had to give up the efficiency guarantees of traditional lex/yacc. I think this paper addresses those concerns.
    I'm not qualified to comment on the theoretical aspects of the paper, got just 2 remarks:

    * I found it somehow easier to read g* ≜ μx. ε ∨ g · x by substituting in my mind μx. with `let rec x = `

    * have you considered generating code for a different target language from context-free expressions? e.g. webassembly, XDP eBPF, and P4 language

  2. This comment has been removed by the author.

  3. Some minor comments:

    Page 1:37, line 30: "We can a term g /\ g'" ~> "We can add a term g /\ g'"
    Page 7:1, line 4: "they must meet the precondition conditions" ~> "they must meet the preconditions"

    Page 7:1, line 6: "which holds when \tau.Follow and \tau'.First are disjoint, and \tau.Null is false", however, Lemma 2.2 (Unique decomposition) states in (2):

    "Suppose First(L)/\FollowLast(M)=0 and ~Null(L)" as precondition

    So I take \tau to correspond to L, and \tau' to correspond to M, but this does not make sense to me. ~> "which holds when \tau.First and \tau'.Follow are disjoint, and \tau.Null is false"

    Kind regards and thanks for this article.