Friday, July 21, 2023

Linear-time parser combinators

My birthday just passed, and to relax I wrote a parser combinator library.

Over the last few years, I have worked quite a bit with Ningning Xie and Jeremy Yallop on parser combinators, which has led to a family of parser combinators which have optimal linear-time performance in theory, and which are many times faster than lex+yacc in practice.

But these use advanced multistage programming techniques, and I wondered how well we can do without staging. So in this blog post I will describe the implementation of a non-staged, but still efficient, parser combinator library in plain OCaml. This will be a lot slower than flap, or even menhir/ocamlyacc, but will still have pretty good linear-time performance.

The basic trick we developed is to check that the grammar that a user defines with our parser combinator library falls in (morally) the class of LL(1) grammars. The LL(1) grammars have the property that they can be parsed with a non-backtracking recursive descent algorithm, in linear time with only a single token of lookahead.

So how do we check this property? Well, first, let's think about how we can form grammars from smaller ones. In a BNF grammar, we define a set

E ::= num 
   | '(' E ')' 
   | E '+' E 
   | E '*' E 
   | E '-' E 

Here. we can see that there are two main forms of building grammatical expressions.

  1. We can build grammatical expressions by concatenation -- the clause E '+' E means that an expression can be a subexpression E, followed by the plus sign +, finally followed by another subexpression E.

  2. We can build grammatical expression by alternation -- the clauses separated by a vertical bar | are alternatives, and the whole string matches if it matches any of the alternative expressions. So an expression can be either a number num, a parenthesized expression ( E ), or a binary arithmetic operator.

Both of these forms of composition can introduce non-determinism. For example, if you see the expression 3 + 4, a recursive descent parser cannot tell that it should take the addition branch (E '+' E) rather than the multiplication branch (E '*' E) without doing more than one token of lookahead (the first token is a 3, which could be followed by anything).

Sequencing can also introduce nondeterminism. For example, the expression 3 + 4 + 5 can be parsed as either Add(3, Add(4, 5)) or as Add(Add(3, 4), 5) -- this grammar does not tell us whether we should stop parsing the first subexpression at 3, or continue to 3 + 4.

So this grammar is not suitable for being parsed by recursive descent. To identify the grammars that can be so parsed, we need to do some static analysis of the grammar -- in fact, a type system. The components of the static analysis we use are intended to handle the two kinds of nondeterminism above.

To handle the nondeterminism introduced by alternation, we will compute first sets for our grammars. That is, we will work out which characters can occur as the first character of a string recognised by the grammar. So if our grammar recognises the set of strings {foo, bar, barton, foobar, quux}, then its first set will be {f, b, q} -- these are the first characters of the strings in the language.

We can use the first sets to rule out alternation nondeterminism, by requiring that each alternative in a grammar rule have a disjoint first set.

Handling sequential nondeterminism is a bit trickier. To do this, we have find pairs of words in the language, in which the first is a prefix of the second (for example, foo and foobar). Then, we return the first character of each suffix (for foo and foobar, we return b). This set of characters is called the follow set of the language -- it is the set of characters which can follow a word in the language, which might be characters of a suffix of that word. So in the language above, the follow set is {b, t}, due to foo/foobar and bar/barton. Because foo can be followed by the string bar to produce the word foobar, b is one of the characters in the follow set, since b is the first character of bar. Similarly, because bar can followed by ton to produce the word barton, t (as the first character of ton) is one of the characters in the follow set.

Now, when we concatenate two grammatical expressions A B, we check to see if the follow set of A and the first set of B are disjoint. This way, after parsing an A-word, it is unambiguous whether to keep parsing an A or switch to parsing a B because the first character will tell us to either continue parsing A (if it is in the follow set of A) or switch to B (if it is in the first set of B).

(ASIDE: Technically, this is not the FOLLOW set as defined in (say) the Dragon book -- that is a property of grammars, not languages. The definition I use here is a superior one (sometimes called the follow-last set), which was invented by Derick Wood and Anne Brüggemann-Klein in their 1992 paper Deterministic Regular Languages. What makes it better is that it is a property of languages, not grammars, and is hence much better-behaved mathematically.)

Finally, we also track whether the language recognises the empty string or not. If both A and B can match the empty string, then A | B is ambiguous – an empty string can match either one.

These three facts – the first set, the follow set, and nullability – are the static facts we will build for every parser combinator, and we will use it to (a) optimise the implementation of conforming combinators, and (b) reject combinators which don't satisfy the restrictions. We could also statically track whether the grammar is left-recursive or not – see my 2019 paper with Jeremy Yallop for how – but for this blog post I'll skip that.

Now, let's talk about the implementation. Since we want to calculate a first set and a follow set for each parser, we need a datatype to represent sets of characters. The Set module in the OCaml standard library is a perfectly serviceable balanced binary tree, but because we want to work with sets of characters, there is a much more compact and faster representation: a bit set.

Since an 8-bit character only has 256 possible states, we can represent a set of characters with 256 bits (32 bytes), with each bit set to 0 if the character is absent from the set, and set to 1 if it is present. This does presume we look at the input at a time -- if we wanted to handle Unicode, likely the best approach would be to work with a canonicalised UTF-8 representation. I'll skip over that for now, since we can deal solely with ASCII for the moment.

This turns out to be pretty much the only optimisation needed to get decent performance -- the library below runs between one-half and two-thirds the speed of lex+menhir on a simple s-expression parsing benchmark.

Unlike many other parser combinator libraries, it does no backtracking (ie, runs in time linear in the input size). Furthermore, it also has a clean declarative reading of the language recognized -- ie, A | B and B | A recognise the same language, unlike in (eg) PEGs, or parser combinators with committed-choice operators. (If you are a parsing nerd, basically we are restricting to LL(1) grammars, modulo the difference between FOLLOW and FOLLOWLAST.)

Here is a link to a gist with the implementation. It's really short – only 300 lines of code for a full combinator library, with about another 150 lines to demonstrate and benchmark it.

Implementation

Character Sets

Below, I give the signature and implementation of an OCaml bitset library.

Character Set Interface

module C : sig
  type t 
  val make : (char -> bool) -> t 
  val empty : t 
  val one : char -> t 
  val union : t -> t -> t 
  val inter : t -> t -> t 
  val top : t 
  val negate : t -> t 

  val mem : char -> t -> bool 
  val equal : t -> t -> bool
  val is_empty : t -> bool 
  val disjoint : t -> t -> bool 
  val fold : (char -> 'a -> 'a) -> 'a -> t -> 'a 
  val of_string : string -> t 
end = 

This interface has an abstract type t of bitsets, and a function make which converts a predicate on characters into a bitset, one which constructs a singleton set, along with the rest of the standard operations of boolean algebra (such as intersection, union, and complement). We also have a membership test mem, as well as operations to test whether two sets are equal, or whether they are empty (is_empty) or nonoverlapping (disjoint).

Character Set Implementation

The implementation of this is quite fun -- I don't have many chances to do bit-fiddling, so I cherish the ones that arise. I represent sets using OCaml's bytes type, which are used, reasonably enough, to represent consecutive sequences of bytes.

struct
  type t = bytes 

  let make f = 
    let open Int in
    let open Char in 
    Bytes.init 32 (fun i -> 
        let b0 = shift_left (Bool.to_int (f (chr (i * 8 + 0)))) 0 in 
        let b1 = shift_left (Bool.to_int (f (chr (i * 8 + 1)))) 1 in 
        let b2 = shift_left (Bool.to_int (f (chr (i * 8 + 2)))) 2 in 
        let b3 = shift_left (Bool.to_int (f (chr (i * 8 + 3)))) 3 in 
        let b4 = shift_left (Bool.to_int (f (chr (i * 8 + 4)))) 4 in 
        let b5 = shift_left (Bool.to_int (f (chr (i * 8 + 5)))) 5 in 
        let b6 = shift_left (Bool.to_int (f (chr (i * 8 + 6)))) 6 in 
        let b7 = shift_left (Bool.to_int (f (chr (i * 8 + 7)))) 7 in 
        let (||) = logor in
        Char.chr (b7 || b6 || b5 || b4 || b3 || b2 || b1 || b0))

The constructor make takes a predicate f : char -> bool, and then builds a bitset which contains exactly those elements which make f true. To do this, we build a 32-byte array (since 8 * 32 = 256). The Bytes.init function initialises a byte at a time, so the initaliser has to make 8 calls to f (at indices i*8 + 0 to i*8 + 7) to figure out what each bit of the byte is. Then, we convert each boolean to an integer, and do a left shift to create an integer where the correct position is set to 1, and all the other bits are 0. Finally we take all these integers and OR them together, to get the byte that we want.

  let mem c s = 
    let b = (Char.code c) / 8 in
    let i = (Char.code c) mod 8 in
    let w = Char.code (Bytes.get s b) in 
    Int.logand (Int.shift_left 1 i) w > 0 

The mem function also does a bit of bit-fiddling. We look up the correct byte and bit-offset corresponding to the input character, and check to see if that bit is set in the corresponding byte of the bitset.

The rest of the operations can be defined in terms of make and mem -- they are the only ones which need any bit-twiddling. ~~ {.ocaml} let empty = make (fun c -> false) let top = make (fun c -> true) let one c = make (fun c' -> c = c') let union s1 s2 = make (fun c -> mem c s1 || mem c s2) let inter s1 s2 = make (fun c -> mem c s1 && mem c s2) let negate s = make (fun c -> not (mem c s))

let equal s1 s2 = let rec loop i acc = if i = 32 then acc else loop (i+1) (acc && (Bytes.get s1 i = Bytes.get s2 i)) in loop 0 true

let is_empty s = equal s empty

let disjoint s1 s2 = is_empty (inter s1 s2) ~~

We do some trivial bit-twiddling anyway with the equal function: rather than comparing character by character, we just check to see if the two bitsets are equal with a byte-by-byte comparison.

Emptiness checking is_empty and disjointness checking disjoint can now be defined in terms of the Boolean operations and equality. We can also define a fold function for looping over a bitset, and we also have a of_string function to build sets from strings.

  let fold f init s = 
    let rec loop i acc = 
      if i > 255 then 
        acc
      else 
        let c = Char.chr i in 
        if mem c s then 
          loop (i+1) (f c acc)
        else
          loop (i+1) acc 
    in
    loop 0 init         

  let of_string str = 
    let p c = String.contains str c in 
    make p 
end

Grammar Types

Grammar Type Interface

This is enough to let us implement the Tp module for parser types.

module Tp :   sig
    type t = { null : bool; first : C.t; follow : C.t; }

    exception TypeError of string
    val char : char -> t
    val eps : t
    val seq : t -> t -> t
    val charset : C.t -> t
    val string : string -> t
    val alt : t -> t -> t
    val bot : t
    val equal : t -> t -> bool
    val fix : (t -> t) -> t
    val print : Format.formatter -> t -> unit
  end
end = 

A type is a triple of a nullability predicate, a first set, and a follow set. This API has a bunch of operators for building semantic types:

  • char c yields the semantic type for the singleton language containing the character c,
  • eps is the type of the language containing only the empty string,
  • seq t1 t2 takes two types t1 and t2, and returns the type corresponding to the sequencing of two languages with those two types
  • bot is the type of the empty language,
  • alt t1 t2 takes two types t1 and t2, and returns the type corresponding to the union of the languages typed by t1 and t2
  • fix f takes a function f : t -> t, and returns the smallest type t such that f(t) = t.
  • charset cs returns the type of the language recognising exactly the single-character strings in cs.

Some of these combinations are impossible due to our ambiguity rules. For example, we don't want to admit the union of two languages both containing the empty string, because then it would be ambiguous whether to take the left or right branch.

So we also have a TypeError exception, which we use to signal when we are combining two incompatible types.

Grammar Type Implementation

struct
  type t = { 
      null : bool; 
      first  : C.t; 
      follow : C.t; 
    }

  exception TypeError of string 

  let char c = { 
      null = false;
      first = C.one c;
      follow = C.empty;
    }

char c returns a record, with false nullability (since the language {"c"} has no empty strings), a first set of {c} (since the first character of the only string in the language is c), and an empty follow set (since there are no nonempty strings you can append to "c" to get a string in {"c"}).

  let eps = { 
      null = true;
      first = C.empty;
      follow = C.empty;
    }

eps returns the type of the language containing only the empty string {""}. It is nullable, since it obviously contains the empty string, but the first and follow sets are empty since it has no strings with any characters in them.

  let seq t1 t2 = 
    let separate t1 t2 = 
      not t1.null
      && 
      C.disjoint t1.follow t2.first 
    in
    if separate t1 t2 then 
      { null = false;
        first = t1.first;
        follow = C.union t2.follow (if t2.null 
                                    then (C.union t2.first t1.follow) 
                                    else C.empty);
      }
    else 
      raise (TypeError "ambiguous sequencing")

Sequencing seq t1 t2 is interesting, because it is the first operator we have seen which can fail. We want to allow parsing the sequencing two languages L and M (with types t1 and t2) using a greedy strategy -- we want to match as much of L as we can, and then switch to looking for M strings. A way to test if this is ok is to check if (a) L does not contain the empty string, and (b) L's follow set is disjoint from M's first set. Because L has no empty strings, we can immediately start parsing for L, and because the first and follow sets are disjoint, we can always choose to keep parsing L as long as there are valid continuation characters for it.

This is what the separate function tests for.

The resulting composite type has false nullability, because every string in the concatenated language has one L-part and and one M-part, and the L-part is always nonempty. The first set is the same as t1's first set, since every string in the concatenated language starts with an L-string.

The follow set is a little tricky. A word w from the concatenation of L and M always consists of an L-part l, and an M-part m, so that w = append(l, m). Obviously any extension of m to a longer m' is still a word of L ; M, so the combined follow set of has to at least contain the follow set of M -- i.e., t2.follow.

However, if M contains the empty string, then L is a subset of L;M, and so all the follow characters of L -- t1.follow -- also have to be included, as well as the first characters of M.

  let alt t1 t2 = 
    let nonoverlapping t1 t2 = 
      not (t1.null && t2.null)
      &&
      C.disjoint t1.first t2.first
    in 
    if nonoverlapping t1 t2 then 
      {
        null = t1.null || t2.null;
        first = C.union t1.first t2.first;
        follow = C.union t1.follow t2.follow;
      }
    else
      raise (TypeError "ambiguous alternation")

  let bot = {
      null = false;
      first = C.empty;
      follow = C.empty;
    }

If t1 and t2 are types for languages L and M, then alt t1 t2 is the type of the union language L ∪ M. Again, typing can fail if the types might correspond to overlapping languages. We rule this out with the nonoverlapping predicate – it checks that at most one language can contain the empty string (i.e., at most one of t1.null and t2.null hold), and that t1.first and t2.first are disjoint – if the first sets are disjoint, then a word in L must start with different character than any character in M. If this condition is satisfied, then the resulting type just takes the union of the first and follow sets, and the OR of the two nullability predicates.

We also define the bot type, for the empty language {}. Since it has no strings in it, the bot.null field is false, and the first and follow sets are empty.

  let charset cs = 
    if C.is_empty cs then 
      bot
    else 
      { null = false;
        first = cs;
        follow = C.empty;
      } 

  let string s = 
    if String.length s = 0 then 
      eps
    else 
      char s.[0]

We can use this to define some convenience functions. charset cs returns the type of the languages containing one-character strings for each character in cs. If the set is empty, we return bot, and otherwise we return a type where the first set is cs, the follow set is empty, and the nullability predicate is false.

string s returns a type for the language {s} containing the single string s. If s is the empty string, we return eps, and otherwise we can return char s.[0], since the first set is the first character of s, the language is not nullable, and the follow set is empty since there are no possible continuations to s.

  let equal t1 t2 = 
    t1.null = t2.null
    && C.equal t1.first t2.first 
    && C.equal t1.follow t2.follow 

  let fix f = 
    let rec loop t = 
      let t' = f t in 
      if equal t t' then 
        t'
      else  
        loop t' 
    in 
    loop bot

Now, we can define the type equality function equal, and use it to define the fixed point function fix f, which returns the least fixed point of the function f. It works in the usual way – we start with the smallest type bot, and then repeatedly apply f to it until it stops changing. If f is monotone, then this loop has to terminate, because on each iteration, we must have added at least one character to follow or first, or changed null from false to true. As a result, there can be at most 511 iterations (= 255 + 255 + 1) before the loop cannot increase any more.


  let print out t = 
    let p fmt = Format.fprintf out fmt in 
    let print_set cs = 
      C.fold (fun c () -> p "%c" c) () cs
    in 
    let print_bool = function
      | true -> p "true"
      | false -> p "false"
    in
    p "{\n";
    p "   null = "; print_bool t.null; p ";\n";
    p "   first = C.of_string \""; print_set t.first; p "\";\n";
    p "   follow = C.of_string \""; print_set t.follow; p "\";\n";
    p "}\n"
end

We can also define a print function for debugging purposes.

Parsers

Parser Interface

module Parser: sig
  type 'a t 
  exception ParseFailure of int

  val (let+) : 'a t -> ('a -> 'b) -> 'b t 
  val map : ('a -> 'b) -> 'a t -> 'b t 

  val unit : unit t 
  val return : 'a -> 'a t 

  val seq : 'a t -> 'b t -> ('a * 'b) t 
  val (and+) : 'a t -> 'b t -> ('a * 'b) t 

  val char : char -> unit t 
  val charset : C.t -> char t 
  val string : string -> unit t 

  val fail : 'a t 
  val (||) : 'a t -> 'a t -> 'a t 

  val any : 'a t list -> 'a t 
  val fix : ('a t -> 'a t) -> 'a t 

  val parse : 'a t -> string -> int -> int * 'a 
end = 

Above, we give the signature to the Parser module, which actually constructs the parsers. There is an abstract type 'a t, which is the type of parsers which return a value of type 'a, and an exception ParseFailure, which returns the location at which parsing failed.

The parser type 'a t is a functor, which is why it supports a map operation.

The applicative (or Cartesian monoidal, if you are a semanticist) structure is exhibited by the unit and seq operations. unit is the parser that recognizes the empty string and immediately returns unit, and seq p1 p2 will try to run p1, then run p2, and return their paired return values.

map has the synonym let+, and seq has a synonym (and+), which together lets us use Ocaml's equivalent of Haskell do-notation for applicatives. So we can write a parser like:

   let+ x = p1 
   and+ y = p2 
   and+ z = p3 
   in 
   x + y + z 

This will run p1, then p2, then p3, and finally return the sum of the three parser values.

The char c, charset cs, and string s parsers build parsers for recognizing individual characters, sets of characters, or single strings. charset and string can be defined from char and the other combinators, but it is convenient to have them as part of the API.

fail is the parser that always fails, never recognising anything, and (||) is the binary alternation operator – p1 || p2 recognises the union of the languages p1 and p2 do. any ps is the n-ary version – it recognises any string that any of the parsers in the list ps do.

Finally, fix f is used to define grammars recursively. If f : 'a t -> 'a t, then fix f recognises the least fixed point of f.

All of these operations are allowed to signal a TypeError if a client tries to combine them in a way which would lead to ambiguity.

Parser Implementation

Below, I give the implementation of the Parser module.

struct
  type 'a t = { tp : Tp.t; parse : string -> int -> int * 'a }
  exception ParseFailure of int

The implementation type of a parser is actually a pair, consisting of a parser type (as defined in the Tp module), and a parsing function of type string -> int -> (int * 'a). So p.parse s i will take a string s, and begin parsing at i. If it succeeds, it will return (j,v), where j is where it finished parsing, and v is the return value of the parser. If the run of the parser failed, then it will raise the ParseFailure exception.

  let char c = 
    let p s i = 
      if i < String.length s && s.[i] = c then 
        (i+1, ())
      else
        raise (ParseFailure i)
    in 
    { tp = Tp.char c; parse = p }

The char c parser is one of the simplest. It constructs its type with a call to the Tp.char function, and constructs a parse function which checks if the next character is c, and returns unit on success and signals failure if there is no next character or it isn't c.

  let (let+) p f = 
    let p' s i = 
      let (i, v) = p.parse s i in 
      (i, f v)
    in 
    {tp = p.tp; parse = p'}

  let map f p = 
    let+ x = p in f x 

As we said before, the parser type 'a t is functorial -- it supports a map operation. All map f p has to do is to write a new parse function which runs p.parse, and then on success wraps the return value with a call to f.

We also define a synonym (let+) to enable the applicative syntax.

  let unit = { tp = Tp.eps; parse = fun s i -> (i, ()) }

  let return x = 
    let+ () = unit in x 


  let (and+) p1 p2 = 
    let p' s i = 
      let (i, a) = p1.parse s i in 
      let (i, b) = p2.parse s i in 
      (i, (a,b))
    in
    { tp = Tp.seq p1.tp p2.tp; parse = p' } 

  let seq = (and+)

unit is the simplest parser – it always succeeds without moving the position of the index into the string (ie, it recognises the empty string) and returns unit. return x is almost as simple: it also immediately succeeds, but returns the value x.

The seq (aka and+) operator sequences two parsers p1 and p2. Its parse function runs p1.parse, yielding an updated index and a value a, and then p1.parse is run starting from the new index, yielding the value b. Then, the final index is returned, alongside the pair (a,b).

Note that this function never backtracks: by construction, all of our parsers will turn out to be greedy, but this is ok since the follow set of p1 and the first set of p2 are known to be disjoint – there is never any risk that p1 will consume characters that are actually part of the start of p2. So we do not have to (as PCRE-style regexps must) do any backtracking.

The returned type simply uses Tp.seq on the two argument parsers' types.

  let charset cs = 
    let p s i = 
      if i < String.length s && C.mem s.[i] cs then 
        (i+1, s.[i])
      else
        raise (ParseFailure i)
    in 
    {tp = Tp.charset cs; parse = p }

  let string str = 
    let p s i = 
      if i + String.length str < String.length s then 
        let rec loop j = 
          if j < String.length str then 
            if str.[j] = s.[i + j] then 
              loop (j+1)
            else
              raise (ParseFailure (i+j))
          else
            (i+j, ())
        in
        loop 0 
      else
        raise (ParseFailure i)
    in
    {tp = Tp.string str; parse = p}        

charset cs and string s are just convenience parsers, which just check if the first character of the input is in the set cs, or whether the prefix of the output is s, respectively.

  let fail = 
    { tp = Tp.bot; 
      parse = fun s i -> raise (ParseFailure i) }

The fail parser always fails, no matter what the input is. This corresponds to recognising the empty language.

The alternation operator p1 || p2 below is the most complex parser definition.

  let (||) p1 p2 = 
    let p' s i = 
      if i < String.length s then 
        if C.mem s.[i] p1.tp.Tp.first then 
          p1.parse s i 
        else if C.mem s.[i] p2.tp.Tp.first then 
          p2.parse s i 
        else if p1.tp.Tp.null then
          p1.parse s i 
        else if p2.tp.Tp.null then 
          p2.parse s i 
        else
          raise (ParseFailure i)
      else if p1.tp.Tp.null then 
        p1.parse s i 
      else if p2.tp.Tp.null then 
        p2.parse s i 
      else 
        raise (ParseFailure i)
    in 
    {tp = Tp.alt p1.tp p2.tp; parse = p' }

  let any ps = List.fold_left (||) fail ps 

Alternation is the only parser combinator that uses the information from the types to decide how to proceed. The reason is that we do not want our parsers to backtrack – if they did, performance would switch from worst-case linear to worst-case exponential. (This is the reason that classic parser combinators have such dreadful performance.)

However, this means that when we see an alternation, we have to decide whether to parse with p1 or p2. In traditional LL(k) style, We do this by looking at the first character of the input. If it is in p1's first set, we parse with p1, and if it is in p2's first set, we parse with p2.

This is easy – except for the corner cases, of which there are two main ones. The first corner case is what we do when the input string is exhausted (ie, if the position we are to start looking at is at the end of the string). In this case, the only string that could match is the empty string, so we check to see whether either p1 or p2is nullable, and then we invoke the nullable parser (if any).

The other corner case is when there is still input left, but its first character doesn't match the first set of either parser. In this case, again the only possible matching string is the empty string, and so we check nullability again.

If none of these conditions match, then the string can't be parsed and we signal an error.

Once we have alternation, we can use it and fail to define the any combinator by folding over the list of parsers with ||.

  let fix f = 
    let g t = (f {fail with tp = t}).tp in
    let r = ref (fail.parse) in 
    let p = f {tp = (Tp.fix g); parse = fun s i -> !r s i} in 
    r := p.parse;
    p 

  let parse p s i = p.parse s i
end

Finally, the fixed point fix f is a little annoying to define, because f : 'a t -> 'a t, and this means that it takes a pair of a type and a parse function, and returns a pair of a type and a parse function. The type-level fixed point Tp.fix wants an argument of type Tp.t -> Tp.t, and the g function in the body turns f into this kind of function using a dummy parser argument to f.

To actually implement the parsing function, we turn to a trick called Landin's knot – we manually implement recursion by backpatching. First, we define r, a pointer to a failing parse function. Then, we invoke f with a parser which takes the type of the fixed point (Tp.fix g), and a parse function which just invokes the function stored in r. This yields a new parser, which we assign to r to tie the knot on the recursion.

The reason we do this, instead of defining the parse function q as

    let rec q s i =  (f {tp = Tp.fix g; parse = q}).parse s i 

is that this definition re-invokes f on each recursive step, whereas the imperative definition only invokes f once.

That's the whole implementation of the parser library! It's just over 300 lines of code.

A Sample Parser

Below, I give the implementation of a simple s-expression parser, which recognises parenthesized sequences of words, like (foo bar (baz (quux) ())).

module Test = struct
  open Parser

  let letter = charset (C.of_string "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

  let digit = charset (C.of_string "0123456789")
  let whitespace = charset (C.of_string " \t\n")

We can use our parsers to define basic parsers that recognise single letters, digits, or whitespace.

  (* val (==>) : 'a t -> ('a -> 'b) -> 'b t 
  let (==>) p f = let+ x = p in f x 

  (* val (>>) : 'a t -> 'b t -> 'a t *)
  let (>>) p1 p2 = 
    let+ x = p1 
    and+ _ = p2 in
    x

Next, I define a couple of utility combinators. ==> is an infix spelling of map, whose utility will become clear shortly.

p1 >> p2 is like seq, parsing with p1 and then p2, except that in this case the value returned by p2 is discarded. This is useful for doing things like discarding trailing whitespace.

  (* val star : 'a t -> 'a list t *)
  let star p = 
    fix (fun r -> 
           any [ 
               unit      ==> (fun () -> []); 
               (seq p r) ==> (fun (x, xs) -> x :: xs)
               ])

  (* val starskip : 'a t -> unit t *)
  let starskip p = 
    fix (fun r -> 
          any [ unit      ==> (fun () -> ());
                (seq p r) ==> (fun _ -> ()) ])

The star p combinator parses with p as many times as it can, and then returns a list of all the p-values that it was able to parse. starskip p is similar, except that it ignores its return value, always returning unit ().

  (* val symbol : string t *)
  let symbol = 
    let+ c = letter
    and+ cs = star letter 
    and+ _ = starskip whitespace 
    in
    let b = Buffer.create 0 in 
    List.iter (Buffer.add_char b) (c :: cs);
    Buffer.contents b

  (* val paren : 'a t -> 'a t *)
  let paren p = 
    let+ () = char '(' >> starskip whitespace 
    and+ x = p 
    and+ () = char ')' >> starskip whitespace 
    in
    x

The symbol parser parses a nonzero list of letters, followed by any amount of whitespace, and converts the list of characters into a string. This does allocate a lot more than it should, with one allocation per character – but for a blog post it's fine.

The paren p combinator call wraps the language recognised by p in parentheses. We also follow the calls to char '(' and char ')' with a starskip whitespace. This is one of the annoying things about writing scannerless parsers – you have to handle whitespace manually.

The discipline I use is to follow each logical token with trailing whitespace, but to never precede logical tokens with whitespace. This ensures that whitespace is never in the first set of a grammatical production, which makes sequencing work out reasonably well. It's still a PITA relative to explicitly tokenized lexer/parser pairs, though.

Finally, we can define the sexp type and parser.

  type sexp = 
    | Sym of string
    | Seq of sexp list 

  (* val sexp : sexp t *)
  let sexp = 
    fix (fun r -> 
          any [ 
              symbol         ==> (fun s -> Sym s);
              paren (star r) ==> (fun xs -> Seq xs)
            ])
end

An s-expression is either a symbol Sym s, or a list of sexpressions Seq ss. The parser for this is literally this definition – it is either symbol (which returns Sym s), or it is a parenthesized list of s-expressions (derived using star on the r recursive combinator.

This also demonstrates why I like the infix map ==> – it looks quite a bit like clausal definitions in pattern matching, or the parser actions in Menhir or Yacc.

Performance

So how fast does it run? Well, to test this out I generated some large s-expressions and parsed them. This was all benchmarked on an Intel i7-6700 running at 3.4 GHz, running Ubuntu 22.04, with the OCaml code all compiled to native code with ocamlopt.

When I run the parser combinator library on a 4.5 megabyte string, I get the following performance numbers.

  • String length: 4498771 bytes
  • Parser elapsed time: 0.736 sec
  • Parsing rate: 6.12e+06 bytes/sec

On a string 10x bigger, the performance of the combinators is slightly faster than a linear speedup would predict.

  • String length: 45007848 bytes
  • Parser elapsed time: 5.571 sec
  • Parsing rate: 8.08e+06 bytes/sec

I don't know why the 45 megabyte string parsers faster than the 4.5 megabyte one. There shouldn't be any fixed latencies of any significance, so I can only chalk it up to garbage collection weirdness.

I also wrote a small ocamllex/Menhir parser, and benchmarked it. The Menhir parser showed similar tendencies, though its absolute speed is higher. For the 4.5 megabyte s-expression, we get:

  • String length: 4498771 bytes
  • Parser elapsed time: 0.492 sec
  • Parsing rate: 9.14e+06 bytes/sec

For the 45 megabyte string, we get:

  • String length: 45007848 bytes
  • Parser elapsed time: 2.989 sec
  • Parsing rate: 1.51e+07 bytes/sec

Again, parsing a longer string shows a more-than-linear speedup. Since LR(1) parsing, like LL(1), is a linear time algorithm, this indicates some stuff going on in the runtime.

Comparing like-for-like, this little parser combinator library is 2/3 the speed of Menhir for the 4.5 megabyte file, and slightly more than 1/2 the speed on the 45 megabyte file.

This is much better performance than I expected! Because parser combinators are essentially big nested closures, I expected there to be a heavy performance penalty because we basically have to do an indirect jump for every character in the input.

This should be a recipe for Python-esque slowness, but I guess the processor architects at Intel really know how to design indirect branch predictors! I checked Agner Fog's optimisation guide, but apparently the design of the branch predictor in Skylake is undocumented. All Fog can say is that "Indirect jumps and indirect calls are predicted well."

The net result of this is that I have to update my beliefs: before my birthday, I thought that parser combinators were basically a bad idea. But now I think they can be a very reasonable parsing technique, even without doing fun stuff like using staging to achieve parser/lexer fusion!

9 comments:

  1. Given languages A = {a, ab} and B = {bc, cd}, the follow set computation described in your article is AB.follow = B.follow = {} (since B is not nullable).

    However the real follow set of AB is {d} since AB contains abc and abcd.

    ReplyDelete
  2. Hi, thanks for commenting!

    Yes, the follow set algorithm does not work in this case. That’s because it will only gives you the right answer when A is non empty and the follow set of A is disjoint from the first set of B. (This is Lemma 3.5.4 in our 2019 paper.)

    In your example A has a follow set of {b}, since ‘a’ is in the language, and can be extended to ‘ab’, and the first set of B is {b,c}. So they both contain ‘b’, and the algorithm for computing the concatenation of folllow sets is not expected to work.

    In the code above, the Tp.seq function checks this condition with a call to the separate function — if it fails, then that parser is rejected. (This typicallly ends up meaning you have to left-factor your grammars for our system to accept it.)

    ReplyDelete
  3. Hi Neel,

    thank you for the lovely article!

    I was wondering, why does the definition of the Follow[Last] set for sequencing differ from your 2019 paper in which it also used the First set of the right operand, i.e.

    ```
    follow = C.union t2.follow (if t2.null then (C.union t2.first t1.follow) else C.empty);
    ```

    I believe that it could constitute an observable difference: consider a grammar given by ‘a’ · (‘b’ | ε) · ‘b’ · ‘c’. When you try to parse the string “abc”, the parser throws a parse failure.
    However, when you use the modified definition of the Follow[Last] set from your 2019 paper, the grammar gets -- correctly, I believe -- rejected for ambiguous sequencing, so the parse failure never occurs in the first place.

    I’m attaching the relevant modification of your parser here: https://pastebin.com/vXCmN2fp

    Is it therefore a good idea to use the definition from the 2019 paper or does that cause other problems elsewhere?

    ReplyDelete
    Replies
    1. So updating the Gist (which is out of date) with the updated code in the blog post, I get the following exception when running the tests:

      ```
      Fatal error: exception Dune__exe__Gist.Tp.TypeError("ambiguous sequencing")
      ```

      If I update the place where the exception is thrown to print out the types being compared, I get:

      ```
      t1 = {
      null = false;
      first = C.of_string "(ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
      follow = C.of_string "
      ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
      }
      t2 = {
      null = true;
      first = C.of_string "(ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
      follow = C.of_string "";
      }
      ```

      Any idea what might be going wrong?

      Delete
    2. Could the issue be to do with the handling of whitespace between symbols? At the moment it seems that two symbols can be separated with a nullable space (based on the use of `starskip whitespace`).

      Delete
    3. Seems like the `null` and `first` sets are also different from the paper? I think they should be:

      null = t1.null && t2.null;
      first = C.union t1.first (if t1.null then t2.first else C.empty);

      Not sure if they will ever fire given the definition of `separate` though.

      I’ve been wondering if you could somehow expand the expressiveness of sequencing. Playing around with the s-expression grammar it seems like you often run into issues with the first parser being nullable, which can be a little frustrating. Surely that should be fine if the second parser is not nullable?

      Delete
  4. Hi, the 2019 paper is right, and there's a bug in this post. Thanks for catching it -- I've updated the post.

    ReplyDelete
  5. Respect and that i have a super supply: How To Become A House Renovation victorian house renovation

    ReplyDelete