My birthday just passed, and to relax I wrote a parser combinator library.
Over the last few years, I have worked quite a bit with Ningning Xie and Jeremy Yallop on parser combinators, which has led to a family of parser combinators which have optimal linear-time performance in theory, and which are many times faster than lex+yacc in practice.
But these use advanced multistage programming techniques, and I wondered how well we can do without staging. So in this blog post I will describe the implementation of a non-staged, but still efficient, parser combinator library in plain OCaml. This will be a lot slower than flap
, or even menhir/ocamlyacc
, but will still have pretty good linear-time performance.
The basic trick we developed is to check that the grammar that a user defines with our parser combinator library falls in (morally) the class of LL(1) grammars. The LL(1) grammars have the property that they can be parsed with a non-backtracking recursive descent algorithm, in linear time with only a single token of lookahead.
So how do we check this property? Well, first, let's think about how we can form grammars from smaller ones. In a BNF grammar, we define a set
E ::= num
| '(' E ')'
| E '+' E
| E '*' E
| E '-' E
Here. we can see that there are two main forms of building grammatical expressions.
We can build grammatical expressions by concatenation -- the clause
E '+' E
means that an expression can be a subexpressionE
, followed by the plus sign+
, finally followed by another subexpressionE
.We can build grammatical expression by alternation -- the clauses separated by a vertical bar
|
are alternatives, and the whole string matches if it matches any of the alternative expressions. So an expression can be either a numbernum
, a parenthesized expression( E )
, or a binary arithmetic operator.
Both of these forms of composition can introduce non-determinism. For example, if you see the expression 3 + 4
, a recursive descent parser cannot tell that it should take the addition branch (E '+' E
) rather than the multiplication branch (E '*' E
) without doing more than one token of lookahead (the first token is a 3
, which could be followed by anything).
Sequencing can also introduce nondeterminism. For example, the expression 3 + 4 + 5
can be parsed as either Add(3, Add(4, 5))
or as Add(Add(3, 4), 5)
-- this grammar does not tell us whether we should stop parsing the first subexpression at 3
, or continue to 3 + 4
.
So this grammar is not suitable for being parsed by recursive descent. To identify the grammars that can be so parsed, we need to do some static analysis of the grammar -- in fact, a type system. The components of the static analysis we use are intended to handle the two kinds of nondeterminism above.
To handle the nondeterminism introduced by alternation, we will compute first sets for our grammars. That is, we will work out which characters can occur as the first character of a string recognised by the grammar. So if our grammar recognises the set of strings {foo, bar, barton, foobar, quux}
, then its first set will be {f, b, q}
-- these are the first characters of the strings in the language.
We can use the first sets to rule out alternation nondeterminism, by requiring that each alternative in a grammar rule have a disjoint first set.
Handling sequential nondeterminism is a bit trickier. To do this, we have find pairs of words in the language, in which the first is a prefix of the second (for example, foo
and foobar
). Then, we return the first character of each suffix (for foo
and foobar
, we return b
). This set of characters is called the follow set of the language -- it is the set of characters which can follow a word in the language, which might be characters of a suffix of that word. So in the language above, the follow set is {b, t}
, due to foo/foobar
and bar/barton
. Because foo
can be followed by the string bar
to produce the word foobar
, b
is one of the characters in the follow set, since b
is the first character of bar
. Similarly, because bar
can followed by ton
to produce the word barton
, t
(as the first character of ton
) is one of the characters in the follow set.
Now, when we concatenate two grammatical expressions A B
, we check to see if the follow set of A
and the first set of B
are disjoint. This way, after parsing an A
-word, it is unambiguous whether to keep parsing an A
or switch to parsing a B
because the first character will tell us to either continue parsing A
(if it is in the follow set of A
) or switch to B
(if it is in the first set of B
).
(ASIDE: Technically, this is not the FOLLOW set as defined in (say) the Dragon book -- that is a property of grammars, not languages. The definition I use here is a superior one (sometimes called the follow-last set), which was invented by Derick Wood and Anne Brüggemann-Klein in their 1992 paper Deterministic Regular Languages. What makes it better is that it is a property of languages, not grammars, and is hence much better-behaved mathematically.)
Finally, we also track whether the language recognises the empty string or not. If both A
and B
can match the empty string, then A | B
is ambiguous – an empty string can match either one.
These three facts – the first set, the follow set, and nullability – are the static facts we will build for every parser combinator, and we will use it to (a) optimise the implementation of conforming combinators, and (b) reject combinators which don't satisfy the restrictions. We could also statically track whether the grammar is left-recursive or not – see my 2019 paper with Jeremy Yallop for how – but for this blog post I'll skip that.
Now, let's talk about the implementation. Since we want to calculate a first set and a follow set for each parser, we need a datatype to represent sets of characters. The Set
module in the OCaml standard library is a perfectly serviceable balanced binary tree, but because we want to work with sets of characters, there is a much more compact and faster representation: a bit set.
Since an 8-bit character only has 256 possible states, we can represent a set of characters with 256 bits (32 bytes), with each bit set to 0 if the character is absent from the set, and set to 1 if it is present. This does presume we look at the input at a time -- if we wanted to handle Unicode, likely the best approach would be to work with a canonicalised UTF-8 representation. I'll skip over that for now, since we can deal solely with ASCII for the moment.
This turns out to be pretty much the only optimisation needed to get decent performance -- the library below runs between one-half and two-thirds the speed of lex+menhir on a simple s-expression parsing benchmark.
Unlike many other parser combinator libraries, it does no backtracking (ie, runs in time linear in the input size). Furthermore, it also has a clean declarative reading of the language recognized -- ie, A | B and B | A recognise the same language, unlike in (eg) PEGs, or parser combinators with committed-choice operators. (If you are a parsing nerd, basically we are restricting to LL(1) grammars, modulo the difference between FOLLOW and FOLLOWLAST.)
Here is a link to a gist with the implementation. It's really short – only 300 lines of code for a full combinator library, with about another 150 lines to demonstrate and benchmark it.
Implementation
Character Sets
Below, I give the signature and implementation of an OCaml bitset library.
Character Set Interface
module C : sig
type t
val make : (char -> bool) -> t
val empty : t
val one : char -> t
val union : t -> t -> t
val inter : t -> t -> t
val top : t
val negate : t -> t
val mem : char -> t -> bool
val equal : t -> t -> bool
val is_empty : t -> bool
val disjoint : t -> t -> bool
val fold : (char -> 'a -> 'a) -> 'a -> t -> 'a
val of_string : string -> t
end =
This interface has an abstract type t
of bitsets, and a function make
which converts a predicate on characters into a bitset, one
which constructs a singleton set, along with the rest of the standard operations of boolean algebra (such as intersection, union, and complement). We also have a membership test mem
, as well as operations to test whether two sets are equal
, or whether they are empty (is_empty
) or nonoverlapping (disjoint
).
Character Set Implementation
The implementation of this is quite fun -- I don't have many chances to do bit-fiddling, so I cherish the ones that arise. I represent sets using OCaml's bytes
type, which are used, reasonably enough, to represent consecutive sequences of bytes.
struct
type t = bytes
let make f =
let open Int in
let open Char in
Bytes.init 32 (fun i ->
let b0 = shift_left (Bool.to_int (f (chr (i * 8 + 0)))) 0 in
let b1 = shift_left (Bool.to_int (f (chr (i * 8 + 1)))) 1 in
let b2 = shift_left (Bool.to_int (f (chr (i * 8 + 2)))) 2 in
let b3 = shift_left (Bool.to_int (f (chr (i * 8 + 3)))) 3 in
let b4 = shift_left (Bool.to_int (f (chr (i * 8 + 4)))) 4 in
let b5 = shift_left (Bool.to_int (f (chr (i * 8 + 5)))) 5 in
let b6 = shift_left (Bool.to_int (f (chr (i * 8 + 6)))) 6 in
let b7 = shift_left (Bool.to_int (f (chr (i * 8 + 7)))) 7 in
let (||) = logor in
Char.chr (b7 || b6 || b5 || b4 || b3 || b2 || b1 || b0))
The constructor make
takes a predicate f : char -> bool
, and then builds a bitset which contains exactly those elements which make f
true. To do this, we build a 32-byte array (since 8 * 32 = 256
). The Bytes.init
function initialises a byte at a time, so the initaliser has to make 8 calls to f
(at indices i*8 + 0
to i*8 + 7
) to figure out what each bit of the byte is. Then, we convert each boolean to an integer, and do a left shift to create an integer where the correct position is set to 1, and all the other bits are 0. Finally we take all these integers and OR them together, to get the byte that we want.
let mem c s =
let b = (Char.code c) / 8 in
let i = (Char.code c) mod 8 in
let w = Char.code (Bytes.get s b) in
Int.logand (Int.shift_left 1 i) w > 0
The mem
function also does a bit of bit-fiddling. We look up the correct byte and bit-offset corresponding to the input character, and check to see if that bit is set in the corresponding byte of the bitset.
The rest of the operations can be defined in terms of make
and mem
-- they are the only ones which need any bit-twiddling. ~~ {.ocaml} let empty = make (fun c -> false) let top = make (fun c -> true) let one c = make (fun c' -> c = c') let union s1 s2 = make (fun c -> mem c s1 || mem c s2) let inter s1 s2 = make (fun c -> mem c s1 && mem c s2) let negate s = make (fun c -> not (mem c s))
let equal s1 s2 = let rec loop i acc = if i = 32 then acc else loop (i+1) (acc && (Bytes.get s1 i = Bytes.get s2 i)) in loop 0 true
let is_empty s = equal s empty
let disjoint s1 s2 = is_empty (inter s1 s2) ~~
We do some trivial bit-twiddling anyway with the equal
function: rather than comparing character by character, we just check to see if the two bitsets are equal with a byte-by-byte comparison.
Emptiness checking is_empty
and disjointness checking disjoint
can now be defined in terms of the Boolean operations and equality. We can also define a fold
function for looping over a bitset, and we also have a of_string
function to build sets from strings.
let fold f init s =
let rec loop i acc =
if i > 255 then
acc
else
let c = Char.chr i in
if mem c s then
loop (i+1) (f c acc)
else
loop (i+1) acc
in
loop 0 init
let of_string str =
let p c = String.contains str c in
make p
end
Grammar Types
Grammar Type Interface
This is enough to let us implement the Tp
module for parser types.
module Tp : sig
type t = { null : bool; first : C.t; follow : C.t; }
exception TypeError of string
val char : char -> t
val eps : t
val seq : t -> t -> t
val charset : C.t -> t
val string : string -> t
val alt : t -> t -> t
val bot : t
val equal : t -> t -> bool
val fix : (t -> t) -> t
val print : Format.formatter -> t -> unit
end
end =
A type is a triple of a nullability predicate, a first set, and a follow set. This API has a bunch of operators for building semantic types:
char c
yields the semantic type for the singleton language containing the characterc
,eps
is the type of the language containing only the empty string,seq t1 t2
takes two typest1
andt2
, and returns the type corresponding to the sequencing of two languages with those two typesbot
is the type of the empty language,alt t1 t2
takes two typest1
andt2
, and returns the type corresponding to the union of the languages typed byt1
andt2
fix f
takes a functionf : t -> t
, and returns the smallest typet
such thatf(t) = t
.charset cs
returns the type of the language recognising exactly the single-character strings incs
.
Some of these combinations are impossible due to our ambiguity rules. For example, we don't want to admit the union of two languages both containing the empty string, because then it would be ambiguous whether to take the left or right branch.
So we also have a TypeError
exception, which we use to signal when we are combining two incompatible types.
Grammar Type Implementation
struct
type t = {
null : bool;
first : C.t;
follow : C.t;
}
exception TypeError of string
let char c = {
null = false;
first = C.one c;
follow = C.empty;
}
char c
returns a record, with false nullability (since the language {"c"}
has no empty strings), a first set of {c}
(since the first character of the only string in the language is c
), and an empty follow set (since there are no nonempty strings you can append to "c"
to get a string in {"c"}
).
let eps = {
null = true;
first = C.empty;
follow = C.empty;
}
eps
returns the type of the language containing only the empty string {""}
. It is nullable, since it obviously contains the empty string, but the first and follow sets are empty since it has no strings with any characters in them.
let seq t1 t2 =
let separate t1 t2 =
not t1.null
&&
C.disjoint t1.follow t2.first
in
if separate t1 t2 then
{ null = false;
first = t1.first;
follow = C.union t2.follow (if t2.null
then (C.union t2.first t1.follow)
else C.empty);
}
else
raise (TypeError "ambiguous sequencing")
Sequencing seq t1 t2
is interesting, because it is the first operator we have seen which can fail. We want to allow parsing the sequencing two languages L
and M
(with types t1
and t2
) using a greedy strategy -- we want to match as much of L
as we can, and then switch to looking for M
strings. A way to test if this is ok is to check if (a) L
does not contain the empty string, and (b) L
's follow set is disjoint from M
's first set. Because L
has no empty strings, we can immediately start parsing for L
, and because the first and follow sets are disjoint, we can always choose to keep parsing L
as long as there are valid continuation characters for it.
This is what the separate
function tests for.
The resulting composite type has false
nullability, because every string in the concatenated language has one L
-part and and one M
-part, and the L
-part is always nonempty. The first set is the same as t1
's first set, since every string in the concatenated language starts with an L
-string.
The follow set is a little tricky. A word w
from the concatenation of L
and M
always consists of an L
-part l
, and an M
-part m
, so that w = append(l, m)
. Obviously any extension of m
to a longer m'
is still a word of L ; M
, so the combined follow set of has to at least contain the follow set of M
-- i.e., t2.follow
.
However, if M
contains the empty string, then L
is a subset of L;M
, and so all the follow characters of L
-- t1.follow
-- also have to be included, as well as the first characters of M
.
let alt t1 t2 =
let nonoverlapping t1 t2 =
not (t1.null && t2.null)
&&
C.disjoint t1.first t2.first
in
if nonoverlapping t1 t2 then
{
null = t1.null || t2.null;
first = C.union t1.first t2.first;
follow = C.union t1.follow t2.follow;
}
else
raise (TypeError "ambiguous alternation")
let bot = {
null = false;
first = C.empty;
follow = C.empty;
}
If t1
and t2
are types for languages L
and M
, then alt t1 t2
is the type of the union language L ∪ M
. Again, typing can fail if the types might correspond to overlapping languages. We rule this out with the nonoverlapping
predicate – it checks that at most one language can contain the empty string (i.e., at most one of t1.null
and t2.null
hold), and that t1.first
and t2.first
are disjoint – if the first sets are disjoint, then a word in L
must start with different character than any character in M
. If this condition is satisfied, then the resulting type just takes the union of the first and follow sets, and the OR of the two nullability predicates.
We also define the bot
type, for the empty language {}
. Since it has no strings in it, the bot.null
field is false, and the first and follow sets are empty.
let charset cs =
if C.is_empty cs then
bot
else
{ null = false;
first = cs;
follow = C.empty;
}
let string s =
if String.length s = 0 then
eps
else
char s.[0]
We can use this to define some convenience functions. charset cs
returns the type of the languages containing one-character strings for each character in cs
. If the set is empty, we return bot
, and otherwise we return a type where the first set is cs
, the follow set is empty, and the nullability predicate is false.
string s
returns a type for the language {s}
containing the single string s
. If s
is the empty string, we return eps
, and otherwise we can return char s.[0]
, since the first set is the first character of s
, the language is not nullable, and the follow set is empty since there are no possible continuations to s
.
let equal t1 t2 =
t1.null = t2.null
&& C.equal t1.first t2.first
&& C.equal t1.follow t2.follow
let fix f =
let rec loop t =
let t' = f t in
if equal t t' then
t'
else
loop t'
in
loop bot
Now, we can define the type equality function equal
, and use it to define the fixed point function fix f
, which returns the least fixed point of the function f
. It works in the usual way – we start with the smallest type bot
, and then repeatedly apply f
to it until it stops changing. If f
is monotone, then this loop has to terminate, because on each iteration, we must have added at least one character to follow
or first
, or changed null
from false to true. As a result, there can be at most 511 iterations (= 255 + 255 + 1) before the loop cannot increase any more.
let print out t =
let p fmt = Format.fprintf out fmt in
let print_set cs =
C.fold (fun c () -> p "%c" c) () cs
in
let print_bool = function
| true -> p "true"
| false -> p "false"
in
p "{\n";
p " null = "; print_bool t.null; p ";\n";
p " first = C.of_string \""; print_set t.first; p "\";\n";
p " follow = C.of_string \""; print_set t.follow; p "\";\n";
p "}\n"
end
We can also define a print
function for debugging purposes.
Parsers
Parser Interface
module Parser: sig
type 'a t
exception ParseFailure of int
val (let+) : 'a t -> ('a -> 'b) -> 'b t
val map : ('a -> 'b) -> 'a t -> 'b t
val unit : unit t
val return : 'a -> 'a t
val seq : 'a t -> 'b t -> ('a * 'b) t
val (and+) : 'a t -> 'b t -> ('a * 'b) t
val char : char -> unit t
val charset : C.t -> char t
val string : string -> unit t
val fail : 'a t
val (||) : 'a t -> 'a t -> 'a t
val any : 'a t list -> 'a t
val fix : ('a t -> 'a t) -> 'a t
val parse : 'a t -> string -> int -> int * 'a
end =
Above, we give the signature to the Parser
module, which actually constructs the parsers. There is an abstract type 'a t
, which is the type of parsers which return a value of type 'a
, and an exception ParseFailure
, which returns the location at which parsing failed.
The parser type 'a t
is a functor, which is why it supports a map
operation.
The applicative (or Cartesian monoidal, if you are a semanticist) structure is exhibited by the unit
and seq
operations. unit
is the parser that recognizes the empty string and immediately returns unit, and seq p1 p2
will try to run p1
, then run p2
, and return their paired return values.
map
has the synonym let+
, and seq
has a synonym (and+)
, which together lets us use Ocaml's equivalent of Haskell do-notation for applicatives. So we can write a parser like:
let+ x = p1
and+ y = p2
and+ z = p3
in
x + y + z
This will run p1
, then p2
, then p3
, and finally return the sum of the three parser values.
The char c
, charset cs
, and string s
parsers build parsers for recognizing individual characters, sets of characters, or single strings. charset
and string
can be defined from char
and the other combinators, but it is convenient to have them as part of the API.
fail
is the parser that always fails, never recognising anything, and (||)
is the binary alternation operator – p1 || p2
recognises the union of the languages p1
and p2
do. any ps
is the n-ary version – it recognises any string that any of the parsers in the list ps
do.
Finally, fix f
is used to define grammars recursively. If f : 'a t -> 'a t
, then fix f
recognises the least fixed point of f
.
All of these operations are allowed to signal a TypeError
if a client tries to combine them in a way which would lead to ambiguity.
Parser Implementation
Below, I give the implementation of the Parser
module.
struct
type 'a t = { tp : Tp.t; parse : string -> int -> int * 'a }
exception ParseFailure of int
The implementation type of a parser is actually a pair, consisting of a parser type (as defined in the Tp
module), and a parsing function of type string -> int -> (int * 'a)
. So p.parse s i
will take a string s
, and begin parsing at i
. If it succeeds, it will return (j,v)
, where j
is where it finished parsing, and v
is the return value of the parser. If the run of the parser failed, then it will raise the ParseFailure
exception.
let char c =
let p s i =
if i < String.length s && s.[i] = c then
(i+1, ())
else
raise (ParseFailure i)
in
{ tp = Tp.char c; parse = p }
The char c
parser is one of the simplest. It constructs its type with a call to the Tp.char
function, and constructs a parse function which checks if the next character is c
, and returns unit on success and signals failure if there is no next character or it isn't c
.
let (let+) p f =
let p' s i =
let (i, v) = p.parse s i in
(i, f v)
in
{tp = p.tp; parse = p'}
let map f p =
let+ x = p in f x
As we said before, the parser type 'a t
is functorial -- it supports a map
operation. All map f p
has to do is to write a new parse function which runs p.parse
, and then on success wraps the return value with a call to f
.
We also define a synonym (let+)
to enable the applicative syntax.
let unit = { tp = Tp.eps; parse = fun s i -> (i, ()) }
let return x =
let+ () = unit in x
let (and+) p1 p2 =
let p' s i =
let (i, a) = p1.parse s i in
let (i, b) = p2.parse s i in
(i, (a,b))
in
{ tp = Tp.seq p1.tp p2.tp; parse = p' }
let seq = (and+)
unit
is the simplest parser – it always succeeds without moving the position of the index into the string (ie, it recognises the empty string) and returns unit. return x
is almost as simple: it also immediately succeeds, but returns the value x
.
The seq
(aka and+
) operator sequences two parsers p1
and p2
. Its parse function runs p1.parse
, yielding an updated index and a value a
, and then p1.parse
is run starting from the new index, yielding the value b
. Then, the final index is returned, alongside the pair (a,b)
.
Note that this function never backtracks: by construction, all of our parsers will turn out to be greedy, but this is ok since the follow set of p1
and the first set of p2
are known to be disjoint – there is never any risk that p1
will consume characters that are actually part of the start of p2
. So we do not have to (as PCRE-style regexps must) do any backtracking.
The returned type simply uses Tp.seq
on the two argument parsers' types.
let charset cs =
let p s i =
if i < String.length s && C.mem s.[i] cs then
(i+1, s.[i])
else
raise (ParseFailure i)
in
{tp = Tp.charset cs; parse = p }
let string str =
let p s i =
if i + String.length str < String.length s then
let rec loop j =
if j < String.length str then
if str.[j] = s.[i + j] then
loop (j+1)
else
raise (ParseFailure (i+j))
else
(i+j, ())
in
loop 0
else
raise (ParseFailure i)
in
{tp = Tp.string str; parse = p}
charset cs
and string s
are just convenience parsers, which just check if the first character of the input is in the set cs
, or whether the prefix of the output is s
, respectively.
let fail =
{ tp = Tp.bot;
parse = fun s i -> raise (ParseFailure i) }
The fail
parser always fails, no matter what the input is. This corresponds to recognising the empty language.
The alternation operator p1 || p2
below is the most complex parser definition.
let (||) p1 p2 =
let p' s i =
if i < String.length s then
if C.mem s.[i] p1.tp.Tp.first then
p1.parse s i
else if C.mem s.[i] p2.tp.Tp.first then
p2.parse s i
else if p1.tp.Tp.null then
p1.parse s i
else if p2.tp.Tp.null then
p2.parse s i
else
raise (ParseFailure i)
else if p1.tp.Tp.null then
p1.parse s i
else if p2.tp.Tp.null then
p2.parse s i
else
raise (ParseFailure i)
in
{tp = Tp.alt p1.tp p2.tp; parse = p' }
let any ps = List.fold_left (||) fail ps
Alternation is the only parser combinator that uses the information from the types to decide how to proceed. The reason is that we do not want our parsers to backtrack – if they did, performance would switch from worst-case linear to worst-case exponential. (This is the reason that classic parser combinators have such dreadful performance.)
However, this means that when we see an alternation, we have to decide whether to parse with p1
or p2
. In traditional LL(k) style, We do this by looking at the first character of the input. If it is in p1
's first set, we parse with p1
, and if it is in p2
's first set, we parse with p2
.
This is easy – except for the corner cases, of which there are two main ones. The first corner case is what we do when the input string is exhausted (ie, if the position we are to start looking at is at the end of the string). In this case, the only string that could match is the empty string, so we check to see whether either p1
or p2
is nullable, and then we invoke the nullable parser (if any).
The other corner case is when there is still input left, but its first character doesn't match the first set of either parser. In this case, again the only possible matching string is the empty string, and so we check nullability again.
If none of these conditions match, then the string can't be parsed and we signal an error.
Once we have alternation, we can use it and fail
to define the any
combinator by folding over the list of parsers with ||
.
let fix f =
let g t = (f {fail with tp = t}).tp in
let r = ref (fail.parse) in
let p = f {tp = (Tp.fix g); parse = fun s i -> !r s i} in
r := p.parse;
p
let parse p s i = p.parse s i
end
Finally, the fixed point fix f
is a little annoying to define, because f : 'a t -> 'a t
, and this means that it takes a pair of a type and a parse function, and returns a pair of a type and a parse function. The type-level fixed point Tp.fix
wants an argument of type Tp.t -> Tp.t
, and the g
function in the body turns f
into this kind of function using a dummy parser argument to f
.
To actually implement the parsing function, we turn to a trick called Landin's knot – we manually implement recursion by backpatching. First, we define r
, a pointer to a failing parse function. Then, we invoke f
with a parser which takes the type of the fixed point (Tp.fix g
), and a parse function which just invokes the function stored in r
. This yields a new parser, which we assign to r
to tie the knot on the recursion.
The reason we do this, instead of defining the parse function q
as
let rec q s i = (f {tp = Tp.fix g; parse = q}).parse s i
is that this definition re-invokes f
on each recursive step, whereas the imperative definition only invokes f
once.
That's the whole implementation of the parser library! It's just over 300 lines of code.
A Sample Parser
Below, I give the implementation of a simple s-expression parser, which recognises parenthesized sequences of words, like (foo bar (baz (quux) ()))
.
module Test = struct
open Parser
let letter = charset (C.of_string "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
let digit = charset (C.of_string "0123456789")
let whitespace = charset (C.of_string " \t\n")
We can use our parsers to define basic parsers that recognise single letters, digits, or whitespace.
(* val (==>) : 'a t -> ('a -> 'b) -> 'b t
let (==>) p f = let+ x = p in f x
(* val (>>) : 'a t -> 'b t -> 'a t *)
let (>>) p1 p2 =
let+ x = p1
and+ _ = p2 in
x
Next, I define a couple of utility combinators. ==>
is an infix spelling of map
, whose utility will become clear shortly.
p1 >> p2
is like seq
, parsing with p1
and then p2
, except that in this case the value returned by p2
is discarded. This is useful for doing things like discarding trailing whitespace.
(* val star : 'a t -> 'a list t *)
let star p =
fix (fun r ->
any [
unit ==> (fun () -> []);
(seq p r) ==> (fun (x, xs) -> x :: xs)
])
(* val starskip : 'a t -> unit t *)
let starskip p =
fix (fun r ->
any [ unit ==> (fun () -> ());
(seq p r) ==> (fun _ -> ()) ])
The star p
combinator parses with p
as many times as it can, and then returns a list of all the p
-values that it was able to parse. starskip p
is similar, except that it ignores its return value, always returning unit ()
.
(* val symbol : string t *)
let symbol =
let+ c = letter
and+ cs = star letter
and+ _ = starskip whitespace
in
let b = Buffer.create 0 in
List.iter (Buffer.add_char b) (c :: cs);
Buffer.contents b
(* val paren : 'a t -> 'a t *)
let paren p =
let+ () = char '(' >> starskip whitespace
and+ x = p
and+ () = char ')' >> starskip whitespace
in
x
The symbol
parser parses a nonzero list of letters, followed by any amount of whitespace, and converts the list of characters into a string. This does allocate a lot more than it should, with one allocation per character – but for a blog post it's fine.
The paren p
combinator call wraps the language recognised by p
in parentheses. We also follow the calls to char '('
and char ')'
with a starskip whitespace
. This is one of the annoying things about writing scannerless parsers – you have to handle whitespace manually.
The discipline I use is to follow each logical token with trailing whitespace, but to never precede logical tokens with whitespace. This ensures that whitespace is never in the first set of a grammatical production, which makes sequencing work out reasonably well. It's still a PITA relative to explicitly tokenized lexer/parser pairs, though.
Finally, we can define the sexp
type and parser.
type sexp =
| Sym of string
| Seq of sexp list
(* val sexp : sexp t *)
let sexp =
fix (fun r ->
any [
symbol ==> (fun s -> Sym s);
paren (star r) ==> (fun xs -> Seq xs)
])
end
An s-expression is either a symbol Sym s
, or a list of sexpressions Seq ss
. The parser for this is literally this definition – it is either symbol (which returns Sym s
), or it is a parenthesized list of s-expressions (derived using star
on the r
recursive combinator.
This also demonstrates why I like the infix map ==>
– it looks quite a bit like clausal definitions in pattern matching, or the parser actions in Menhir or Yacc.
Performance
So how fast does it run? Well, to test this out I generated some large s-expressions and parsed them. This was all benchmarked on an Intel i7-6700 running at 3.4 GHz, running Ubuntu 22.04, with the OCaml code all compiled to native code with ocamlopt
.
When I run the parser combinator library on a 4.5 megabyte string, I get the following performance numbers.
- String length: 4498771 bytes
- Parser elapsed time: 0.736 sec
- Parsing rate: 6.12e+06 bytes/sec
On a string 10x bigger, the performance of the combinators is slightly faster than a linear speedup would predict.
- String length: 45007848 bytes
- Parser elapsed time: 5.571 sec
- Parsing rate: 8.08e+06 bytes/sec
I don't know why the 45 megabyte string parsers faster than the 4.5 megabyte one. There shouldn't be any fixed latencies of any significance, so I can only chalk it up to garbage collection weirdness.
I also wrote a small ocamllex/Menhir parser, and benchmarked it. The Menhir parser showed similar tendencies, though its absolute speed is higher. For the 4.5 megabyte s-expression, we get:
- String length: 4498771 bytes
- Parser elapsed time: 0.492 sec
- Parsing rate: 9.14e+06 bytes/sec
For the 45 megabyte string, we get:
- String length: 45007848 bytes
- Parser elapsed time: 2.989 sec
- Parsing rate: 1.51e+07 bytes/sec
Again, parsing a longer string shows a more-than-linear speedup. Since LR(1) parsing, like LL(1), is a linear time algorithm, this indicates some stuff going on in the runtime.
Comparing like-for-like, this little parser combinator library is 2/3 the speed of Menhir for the 4.5 megabyte file, and slightly more than 1/2 the speed on the 45 megabyte file.
This is much better performance than I expected! Because parser combinators are essentially big nested closures, I expected there to be a heavy performance penalty because we basically have to do an indirect jump for every character in the input.
This should be a recipe for Python-esque slowness, but I guess the processor architects at Intel really know how to design indirect branch predictors! I checked Agner Fog's optimisation guide, but apparently the design of the branch predictor in Skylake is undocumented. All Fog can say is that "Indirect jumps and indirect calls are predicted well."
The net result of this is that I have to update my beliefs: before my birthday, I thought that parser combinators were basically a bad idea. But now I think they can be a very reasonable parsing technique, even without doing fun stuff like using staging to achieve parser/lexer fusion!