If you want to understand bottom-up parsing, one of the best places to start is Djikstra's shunting yard algorithm for parsing arithmetic expressions. Unlike general LR parsing, it maintains separate stacks for the intermediate parse trees and the parsing problems to be done. It's simple enough, in fact, that for simple problems my go-to parsing method is to use a recursive descent which makes subroutine calls to Djikstra's algorithm. (For more complicated problems, I use yacc and profanity.)
Recently, I read Danvy and Millikin's Refunctionalization at Work, which included as an example of refunctionalization Djikstra's shunting yard algorithm. What's really elegant about their presentation of this algorithm is that they show how to encode the control structure into the call stack (the way functions are supposed to work).
type token = LIT of int | PLUS | TIMES | LPAREN | RPAREN type exp = Int of int | Add of exp * exp | Mul of exp * exp let rec exp = function | ([], [e]) -> e | (LIT n :: ts, es) -> exp (ts, Int n :: es) | (PLUS :: ts, es) -> exp (add (ts, es)) | (TIMES :: ts, es) -> exp (mul (ts, es)) | (LPAREN :: ts, es) -> exp (par (ts, es)) | (ts, es) -> failwith "nil: no parse" and add = function | ([], e2 :: e1 :: es) -> ([], Add(e1, e2) :: es) | (LIT n :: ts, es) -> add (ts, Int n :: es) | ((PLUS :: _) as ts, e2 :: e1 :: es) -> (ts, Add(e1, e2) :: es) | (TIMES :: ts, es) -> add (mul (ts, es)) | (LPAREN :: ts, es) -> add (par (ts, es)) | ((RPAREN :: _) as ts, e2 :: e1 :: es) -> (ts, Add(e1, e2) :: es) | (ts, es) -> failwith "add: no parse" and mul = function | ([], e2 :: e1 :: es) -> ([], Mul(e1, e2) :: es) | (LIT n :: ts, es) -> mul (ts, Int n :: es) | ((PLUS :: _) as ts, e2 :: e1 :: es) -> (ts, Mul(e1, e2) :: es) | ((TIMES :: _) as ts, e2 :: e1 :: es) -> (ts, Mul(e1, e2) :: es) | (LPAREN :: ts, es) -> add (par (ts, es)) | ((RPAREN :: _) as ts, e2 :: e1 :: es) -> (ts, Mul(e1, e2) :: es) | (ts, es) -> failwith "mul: no parse" and par = function | (LIT n :: ts, es) -> par (ts, Int n :: es) | (PLUS :: ts, es) -> par (add (ts, es)) | (TIMES :: ts, es) -> par (mul (ts, es)) | (LPAREN :: ts, es) -> par (par (ts, es)) | (RPAREN :: ts, es) -> (ts, es) | (ts, es) -> failwith "par: no parse" let parse ts = exp (ts, [])
This is really very nice, and make some of the more opaque features of bottom-up parsing much more comprehensible. In particular, shift actions correspond to tail calls in the parsing procedures, and uses of the goto table correspond precisely to non-tail calls. For example, in the fourth line of add
, we recursively call the mul
procedure, and then resume parsing in the add
mode once mul
returns.
One natural question is whether we could profit by making the partial parse stack es
implicit (eg, by switching to monadic style). (Eg, are there any equational derivations that this would simplify?) Another is whether this style works for general LR(k) parsing. If so, it could yield a significantly better exposition of LR parsing than the standard automata-theoretic accounts.