Friday, October 13, 2023

The Ackermann-Péter Function in Gödel's T

The Ackermann-Péter function is defined as:

A : ℕ × ℕ A(0, n)     = n+1 
A(m+1, 0)   = A(m, 1)
A(m+1, n+1) = A(m, A(m+1, n))

Famously, it is not expressible as a primitive recursive function. However, in lecture I showed my students that it is possible to define this function in Gödel's T. That is to say, the Ackermann-Péter function is nevertheless "higher-order primitive recursive". However, I didn't have time to really explain where the definition I supplied them with came from, an issue I wanted to rectify with this blog post.

The best way to explain it is to show how the definition in Gödel's T can be derived from the original definition. To do so, let's start by writing A in a curried style:

A :A 0     n     = n+1 
A (m+1) 0     = A m 1
A (m+1) (n+1) = A m (A (m+1) n)

Note that the pair of clauses for A (m+1) look like a primitive recursive definition over the second argument. Following this idea, let's abuse notation and define a function B which is used instead of the direct call to A (m+1):

B :B m 0     = A m 1 
B m (n+1) = A m (B m n)

A :A (0)   n     = n+1 
A (m+1) n     = B m n 

This looks a bit like we are going backwards: to define B, we needed to use mutual recursion. Unfortunately, this is not a feature of Gödel's T. However, note that our definition of B only uses its m argument to pass it to A. As a result, if we pass B the function A m as an argument, we can both remove the m argument and the mutual recursion with A:

B : (ℕ  ℕ) B f 0     = f 1 
B f (n+1) = f (B f n)

A :A (0)   n     = n+1 
A (m+1) n     = B (A m) n 

These are two independent, iterative functions now! We could stop right here, and turn them from functional equations into terms in Gödel's T, except that I find the first clause of B, that B f 0 = f 1 a bit inelegant. Let's define a new function, repeat:

repeat : (ℕ  ℕ)  (ℕ  ℕ)
repeat f 0     = id
repeat f (n+1) = f ∘ (repeat f n)

The call repeat f n just defines the n-fold self-composition of the function f. I like this more than the raw definition of B, since it is a fairly mathematically natural notion. However, we can still use repeat to define B:

B : (ℕ  ℕ) B f n     = repeat f n (f 1)

Putting this all together, we can define the whole Ackermann-Péter function as:

repeat : (ℕ  ℕ)  (ℕ  ℕ)
repeat f 0 = id
repeat f (n+1) = f ∘ (repeat f n)

B : (ℕ  ℕ) B f n = repeat f n (f 1)

A :A (0)   n  = n+1 
A (m+1) n  = B (A m) n 

Since B is neither recursive, nor used more than once, it hardly makes sense as a separate definition. So we can inline it and get:

repeat : (ℕ  ℕ)  (ℕ  ℕ)
repeat f 0     = id
repeat f (n+1) = f ∘ (repeat f n)

A :A (0)   n = n+1 
A (m+1) n = repeat (A m) n ((A m) 1)

Note that the right-hand-side of the A (m+1) n clause only contains occurences of A m. Hence, we can write this whole thing as a primitive iteration.

repeat : (ℕ  ℕ)  (ℕ  ℕ)
repeat = λf:. λn:. iter(n, 0  id,
                                 S(r)  f ∘ r)

A :A = λm:. iter(m, 0  (λn:. S(n)),
                  S(r)  (λn:. repeat r n (r 1)))

And this is just the definition I gave in lecture!

Thursday, September 21, 2023

Actually, We Will Read Your Thesis

One of the weirder (and wronger) bits of "folk wisdom" surrounding the PhD is that no one reads PhD dissertations. If you look online, you'll find many posts like this one, which ask why anyone writes PhD theses when no one reads them.

This would be a good question, if it were based on a true premise. In fact, if you write even a moderately decent thesis, it will probably be read by many researchers in your field. This is for a simple reason: a thesis has a much higher page limit than any of the papers upon which it is based. As a result, there's more room for the author to explain their results. Furthermore, because it is the author's second attempt at explaining the material, they will also have a better idea of how to explain their results.

When a PhD student makes a discovery or an invention, they and their collaborators will write a paper about it. Papers are targeted at the leading experts in the field, and since the leading experts (pretty much in virtue of being experts) already understand all the basics and know where the state-of-the-art is, research papers are usually diffs. We explain only the novelties of the work, leaving the shared basics of understanding unexplained and tacit.

While this may be fine for experts, it makes accessing the literature difficult for novices (or even experts in related areas). It's like joining a long-running conversation in the middle: there will be a lot of shared context that the new reader lacks.

This is what makes PhD theses so valuable. Because there is much more space, the author of a PhD thesis has the room to explain what they did, from whatever point they think makes sense, rather than solely in terms of what the delta is. So whenever I find a paper I don't understand, I start looking for the PhD thesis based on it. Nine times out of ten, the thesis is vastly more understandable: "obvious" lemmas will have explicit proofs, algorithms will have detailed pseudocode, and the right intuitions and perspectives to take about the topic will be spelled out.

So why is there so much negative folklore around it? My hunch is that it is a way for students to psyche themselves into actually submitting their dissertation. Many PhD students are perfectionists (since they are generally bright people with high standards), and they find the thought of their dissertations becoming part of the scientific record very stressful. By telling themselves no one will read it, they can overcome their perfectionism and actually submit the document once it is done.

Writing a thesis is hard work, so I don't begrudge anyone the psychological tricks they might need to actually finish. But I think anyone who has finished writing a dissertation should feel proud of their work, and accept that many people in their field will, in fact, read it. (And if you are a researcher not yet in the habit of reading theses, let me suggest that you start -- it is a superpower for understanding things!)

Friday, July 21, 2023

Linear-time parser combinators

My birthday just passed, and to relax I wrote a parser combinator library.

Over the last few years, I have worked quite a bit with Ningning Xie and Jeremy Yallop on parser combinators, which has led to a family of parser combinators which have optimal linear-time performance in theory, and which are many times faster than lex+yacc in practice.

But these use advanced multistage programming techniques, and I wondered how well we can do without staging. So in this blog post I will describe the implementation of a non-staged, but still efficient, parser combinator library in plain OCaml. This will be a lot slower than flap, or even menhir/ocamlyacc, but will still have pretty good linear-time performance.

The basic trick we developed is to check that the grammar that a user defines with our parser combinator library falls in (morally) the class of LL(1) grammars. The LL(1) grammars have the property that they can be parsed with a non-backtracking recursive descent algorithm, in linear time with only a single token of lookahead.

So how do we check this property? Well, first, let's think about how we can form grammars from smaller ones. In a BNF grammar, we define a set

E ::= num 
   | '(' E ')' 
   | E '+' E 
   | E '*' E 
   | E '-' E 

Here. we can see that there are two main forms of building grammatical expressions.

  1. We can build grammatical expressions by concatenation -- the clause E '+' E means that an expression can be a subexpression E, followed by the plus sign +, finally followed by another subexpression E.

  2. We can build grammatical expression by alternation -- the clauses separated by a vertical bar | are alternatives, and the whole string matches if it matches any of the alternative expressions. So an expression can be either a number num, a parenthesized expression ( E ), or a binary arithmetic operator.

Both of these forms of composition can introduce non-determinism. For example, if you see the expression 3 + 4, a recursive descent parser cannot tell that it should take the addition branch (E '+' E) rather than the multiplication branch (E '*' E) without doing more than one token of lookahead (the first token is a 3, which could be followed by anything).

Sequencing can also introduce nondeterminism. For example, the expression 3 + 4 + 5 can be parsed as either Add(3, Add(4, 5)) or as Add(Add(3, 4), 5) -- this grammar does not tell us whether we should stop parsing the first subexpression at 3, or continue to 3 + 4.

So this grammar is not suitable for being parsed by recursive descent. To identify the grammars that can be so parsed, we need to do some static analysis of the grammar -- in fact, a type system. The components of the static analysis we use are intended to handle the two kinds of nondeterminism above.

To handle the nondeterminism introduced by alternation, we will compute first sets for our grammars. That is, we will work out which characters can occur as the first character of a string recognised by the grammar. So if our grammar recognises the set of strings {foo, bar, barton, foobar, quux}, then its first set will be {f, b, q} -- these are the first characters of the strings in the language.

We can use the first sets to rule out alternation nondeterminism, by requiring that each alternative in a grammar rule have a disjoint first set.

Handling sequential nondeterminism is a bit trickier. To do this, we have find pairs of words in the language, in which the first is a prefix of the second (for example, foo and foobar). Then, we return the first character of each suffix (for foo and foobar, we return b). This set of characters is called the follow set of the language -- it is the set of characters which can follow a word in the language, which might be characters of a suffix of that word. So in the language above, the follow set is {b, t}, due to foo/foobar and bar/barton. Because foo can be followed by the string bar to produce the word foobar, b is one of the characters in the follow set, since b is the first character of bar. Similarly, because bar can followed by ton to produce the word barton, t (as the first character of ton) is one of the characters in the follow set.

Now, when we concatenate two grammatical expressions A B, we check to see if the follow set of A and the first set of B are disjoint. This way, after parsing an A-word, it is unambiguous whether to keep parsing an A or switch to parsing a B because the first character will tell us to either continue parsing A (if it is in the follow set of A) or switch to B (if it is in the first set of B).

(ASIDE: Technically, this is not the FOLLOW set as defined in (say) the Dragon book -- that is a property of grammars, not languages. The definition I use here is a superior one (sometimes called the follow-last set), which was invented by Derick Wood and Anne Brüggemann-Klein in their 1992 paper Deterministic Regular Languages. What makes it better is that it is a property of languages, not grammars, and is hence much better-behaved mathematically.)

Finally, we also track whether the language recognises the empty string or not. If both A and B can match the empty string, then A | B is ambiguous – an empty string can match either one.

These three facts – the first set, the follow set, and nullability – are the static facts we will build for every parser combinator, and we will use it to (a) optimise the implementation of conforming combinators, and (b) reject combinators which don't satisfy the restrictions. We could also statically track whether the grammar is left-recursive or not – see my 2019 paper with Jeremy Yallop for how – but for this blog post I'll skip that.

Now, let's talk about the implementation. Since we want to calculate a first set and a follow set for each parser, we need a datatype to represent sets of characters. The Set module in the OCaml standard library is a perfectly serviceable balanced binary tree, but because we want to work with sets of characters, there is a much more compact and faster representation: a bit set.

Since an 8-bit character only has 256 possible states, we can represent a set of characters with 256 bits (32 bytes), with each bit set to 0 if the character is absent from the set, and set to 1 if it is present. This does presume we look at the input at a time -- if we wanted to handle Unicode, likely the best approach would be to work with a canonicalised UTF-8 representation. I'll skip over that for now, since we can deal solely with ASCII for the moment.

This turns out to be pretty much the only optimisation needed to get decent performance -- the library below runs between one-half and two-thirds the speed of lex+menhir on a simple s-expression parsing benchmark.

Unlike many other parser combinator libraries, it does no backtracking (ie, runs in time linear in the input size). Furthermore, it also has a clean declarative reading of the language recognized -- ie, A | B and B | A recognise the same language, unlike in (eg) PEGs, or parser combinators with committed-choice operators. (If you are a parsing nerd, basically we are restricting to LL(1) grammars, modulo the difference between FOLLOW and FOLLOWLAST.)

Here is a link to a gist with the implementation. It's really short – only 300 lines of code for a full combinator library, with about another 150 lines to demonstrate and benchmark it.


Character Sets

Below, I give the signature and implementation of an OCaml bitset library.

Character Set Interface

module C : sig
  type t 
  val make : (char -> bool) -> t 
  val empty : t 
  val one : char -> t 
  val union : t -> t -> t 
  val inter : t -> t -> t 
  val top : t 
  val negate : t -> t 

  val mem : char -> t -> bool 
  val equal : t -> t -> bool
  val is_empty : t -> bool 
  val disjoint : t -> t -> bool 
  val fold : (char -> 'a -> 'a) -> 'a -> t -> 'a 
  val of_string : string -> t 
end = 

This interface has an abstract type t of bitsets, and a function make which converts a predicate on characters into a bitset, one which constructs a singleton set, along with the rest of the standard operations of boolean algebra (such as intersection, union, and complement). We also have a membership test mem, as well as operations to test whether two sets are equal, or whether they are empty (is_empty) or nonoverlapping (disjoint).

Character Set Implementation

The implementation of this is quite fun -- I don't have many chances to do bit-fiddling, so I cherish the ones that arise. I represent sets using OCaml's bytes type, which are used, reasonably enough, to represent consecutive sequences of bytes.

  type t = bytes 

  let make f = 
    let open Int in
    let open Char in 
    Bytes.init 32 (fun i -> 
        let b0 = shift_left (Bool.to_int (f (chr (i * 8 + 0)))) 0 in 
        let b1 = shift_left (Bool.to_int (f (chr (i * 8 + 1)))) 1 in 
        let b2 = shift_left (Bool.to_int (f (chr (i * 8 + 2)))) 2 in 
        let b3 = shift_left (Bool.to_int (f (chr (i * 8 + 3)))) 3 in 
        let b4 = shift_left (Bool.to_int (f (chr (i * 8 + 4)))) 4 in 
        let b5 = shift_left (Bool.to_int (f (chr (i * 8 + 5)))) 5 in 
        let b6 = shift_left (Bool.to_int (f (chr (i * 8 + 6)))) 6 in 
        let b7 = shift_left (Bool.to_int (f (chr (i * 8 + 7)))) 7 in 
        let (||) = logor in
        Char.chr (b7 || b6 || b5 || b4 || b3 || b2 || b1 || b0))

The constructor make takes a predicate f : char -> bool, and then builds a bitset which contains exactly those elements which make f true. To do this, we build a 32-byte array (since 8 * 32 = 256). The Bytes.init function initialises a byte at a time, so the initaliser has to make 8 calls to f (at indices i*8 + 0 to i*8 + 7) to figure out what each bit of the byte is. Then, we convert each boolean to an integer, and do a left shift to create an integer where the correct position is set to 1, and all the other bits are 0. Finally we take all these integers and OR them together, to get the byte that we want.

  let mem c s = 
    let b = (Char.code c) / 8 in
    let i = (Char.code c) mod 8 in
    let w = Char.code (Bytes.get s b) in 
    Int.logand (Int.shift_left 1 i) w > 0 

The mem function also does a bit of bit-fiddling. We look up the correct byte and bit-offset corresponding to the input character, and check to see if that bit is set in the corresponding byte of the bitset.

The rest of the operations can be defined in terms of make and mem -- they are the only ones which need any bit-twiddling. ~~ {.ocaml} let empty = make (fun c -> false) let top = make (fun c -> true) let one c = make (fun c' -> c = c') let union s1 s2 = make (fun c -> mem c s1 || mem c s2) let inter s1 s2 = make (fun c -> mem c s1 && mem c s2) let negate s = make (fun c -> not (mem c s))

let equal s1 s2 = let rec loop i acc = if i = 32 then acc else loop (i+1) (acc && (Bytes.get s1 i = Bytes.get s2 i)) in loop 0 true

let is_empty s = equal s empty

let disjoint s1 s2 = is_empty (inter s1 s2) ~~

We do some trivial bit-twiddling anyway with the equal function: rather than comparing character by character, we just check to see if the two bitsets are equal with a byte-by-byte comparison.

Emptiness checking is_empty and disjointness checking disjoint can now be defined in terms of the Boolean operations and equality. We can also define a fold function for looping over a bitset, and we also have a of_string function to build sets from strings.

  let fold f init s = 
    let rec loop i acc = 
      if i > 255 then 
        let c = Char.chr i in 
        if mem c s then 
          loop (i+1) (f c acc)
          loop (i+1) acc 
    loop 0 init         

  let of_string str = 
    let p c = String.contains str c in 
    make p 

Grammar Types

Grammar Type Interface

This is enough to let us implement the Tp module for parser types.

module Tp :   sig
    type t = { null : bool; first : C.t; follow : C.t; }

    exception TypeError of string
    val char : char -> t
    val eps : t
    val seq : t -> t -> t
    val charset : C.t -> t
    val string : string -> t
    val alt : t -> t -> t
    val bot : t
    val equal : t -> t -> bool
    val fix : (t -> t) -> t
    val print : Format.formatter -> t -> unit
end = 

A type is a triple of a nullability predicate, a first set, and a follow set. This API has a bunch of operators for building semantic types:

  • char c yields the semantic type for the singleton language containing the character c,
  • eps is the type of the language containing only the empty string,
  • seq t1 t2 takes two types t1 and t2, and returns the type corresponding to the sequencing of two languages with those two types
  • bot is the type of the empty language,
  • alt t1 t2 takes two types t1 and t2, and returns the type corresponding to the union of the languages typed by t1 and t2
  • fix f takes a function f : t -> t, and returns the smallest type t such that f(t) = t.
  • charset cs returns the type of the language recognising exactly the single-character strings in cs.

Some of these combinations are impossible due to our ambiguity rules. For example, we don't want to admit the union of two languages both containing the empty string, because then it would be ambiguous whether to take the left or right branch.

So we also have a TypeError exception, which we use to signal when we are combining two incompatible types.

Grammar Type Implementation

  type t = { 
      null : bool; 
      first  : C.t; 
      follow : C.t; 

  exception TypeError of string 

  let char c = { 
      null = false;
      first = c;
      follow = C.empty;

char c returns a record, with false nullability (since the language {"c"} has no empty strings), a first set of {c} (since the first character of the only string in the language is c), and an empty follow set (since there are no nonempty strings you can append to "c" to get a string in {"c"}).

  let eps = { 
      null = true;
      first = C.empty;
      follow = C.empty;

eps returns the type of the language containing only the empty string {""}. It is nullable, since it obviously contains the empty string, but the first and follow sets are empty since it has no strings with any characters in them.

  let seq t1 t2 = 
    let separate t1 t2 = 
      not t1.null
      C.disjoint t1.follow t2.first 
    if separate t1 t2 then 
      { null = false;
        first = t1.first;
        follow = C.union t2.follow (if t2.null 
                                    then (C.union t2.first t1.follow) 
                                    else C.empty);
      raise (TypeError "ambiguous sequencing")

Sequencing seq t1 t2 is interesting, because it is the first operator we have seen which can fail. We want to allow parsing the sequencing two languages L and M (with types t1 and t2) using a greedy strategy -- we want to match as much of L as we can, and then switch to looking for M strings. A way to test if this is ok is to check if (a) L does not contain the empty string, and (b) L's follow set is disjoint from M's first set. Because L has no empty strings, we can immediately start parsing for L, and because the first and follow sets are disjoint, we can always choose to keep parsing L as long as there are valid continuation characters for it.

This is what the separate function tests for.

The resulting composite type has false nullability, because every string in the concatenated language has one L-part and and one M-part, and the L-part is always nonempty. The first set is the same as t1's first set, since every string in the concatenated language starts with an L-string.

The follow set is a little tricky. A word w from the concatenation of L and M always consists of an L-part l, and an M-part m, so that w = append(l, m). Obviously any extension of m to a longer m' is still a word of L ; M, so the combined follow set of has to at least contain the follow set of M -- i.e., t2.follow.

However, if M contains the empty string, then L is a subset of L;M, and so all the follow characters of L -- t1.follow -- also have to be included, as well as the first characters of M.

  let alt t1 t2 = 
    let nonoverlapping t1 t2 = 
      not (t1.null && t2.null)
      C.disjoint t1.first t2.first
    if nonoverlapping t1 t2 then 
        null = t1.null || t2.null;
        first = C.union t1.first t2.first;
        follow = C.union t1.follow t2.follow;
      raise (TypeError "ambiguous alternation")

  let bot = {
      null = false;
      first = C.empty;
      follow = C.empty;

If t1 and t2 are types for languages L and M, then alt t1 t2 is the type of the union language L ∪ M. Again, typing can fail if the types might correspond to overlapping languages. We rule this out with the nonoverlapping predicate – it checks that at most one language can contain the empty string (i.e., at most one of t1.null and t2.null hold), and that t1.first and t2.first are disjoint – if the first sets are disjoint, then a word in L must start with different character than any character in M. If this condition is satisfied, then the resulting type just takes the union of the first and follow sets, and the OR of the two nullability predicates.

We also define the bot type, for the empty language {}. Since it has no strings in it, the bot.null field is false, and the first and follow sets are empty.

  let charset cs = 
    if C.is_empty cs then 
      { null = false;
        first = cs;
        follow = C.empty;

  let string s = 
    if String.length s = 0 then 
      char s.[0]

We can use this to define some convenience functions. charset cs returns the type of the languages containing one-character strings for each character in cs. If the set is empty, we return bot, and otherwise we return a type where the first set is cs, the follow set is empty, and the nullability predicate is false.

string s returns a type for the language {s} containing the single string s. If s is the empty string, we return eps, and otherwise we can return char s.[0], since the first set is the first character of s, the language is not nullable, and the follow set is empty since there are no possible continuations to s.

  let equal t1 t2 = 
    t1.null = t2.null
    && C.equal t1.first t2.first 
    && C.equal t1.follow t2.follow 

  let fix f = 
    let rec loop t = 
      let t' = f t in 
      if equal t t' then 
        loop t' 
    loop bot

Now, we can define the type equality function equal, and use it to define the fixed point function fix f, which returns the least fixed point of the function f. It works in the usual way – we start with the smallest type bot, and then repeatedly apply f to it until it stops changing. If f is monotone, then this loop has to terminate, because on each iteration, we must have added at least one character to follow or first, or changed null from false to true. As a result, there can be at most 511 iterations (= 255 + 255 + 1) before the loop cannot increase any more.

  let print out t = 
    let p fmt = Format.fprintf out fmt in 
    let print_set cs = 
      C.fold (fun c () -> p "%c" c) () cs
    let print_bool = function
      | true -> p "true"
      | false -> p "false"
    p "{\n";
    p "   null = "; print_bool t.null; p ";\n";
    p "   first = C.of_string \""; print_set t.first; p "\";\n";
    p "   follow = C.of_string \""; print_set t.follow; p "\";\n";
    p "}\n"

We can also define a print function for debugging purposes.


Parser Interface

module Parser: sig
  type 'a t 
  exception ParseFailure of int

  val (let+) : 'a t -> ('a -> 'b) -> 'b t 
  val map : ('a -> 'b) -> 'a t -> 'b t 

  val unit : unit t 
  val return : 'a -> 'a t 

  val seq : 'a t -> 'b t -> ('a * 'b) t 
  val (and+) : 'a t -> 'b t -> ('a * 'b) t 

  val char : char -> unit t 
  val charset : C.t -> char t 
  val string : string -> unit t 

  val fail : 'a t 
  val (||) : 'a t -> 'a t -> 'a t 

  val any : 'a t list -> 'a t 
  val fix : ('a t -> 'a t) -> 'a t 

  val parse : 'a t -> string -> int -> int * 'a 
end = 

Above, we give the signature to the Parser module, which actually constructs the parsers. There is an abstract type 'a t, which is the type of parsers which return a value of type 'a, and an exception ParseFailure, which returns the location at which parsing failed.

The parser type 'a t is a functor, which is why it supports a map operation.

The applicative (or Cartesian monoidal, if you are a semanticist) structure is exhibited by the unit and seq operations. unit is the parser that recognizes the empty string and immediately returns unit, and seq p1 p2 will try to run p1, then run p2, and return their paired return values.

map has the synonym let+, and seq has a synonym (and+), which together lets us use Ocaml's equivalent of Haskell do-notation for applicatives. So we can write a parser like:

   let+ x = p1 
   and+ y = p2 
   and+ z = p3 
   x + y + z 

This will run p1, then p2, then p3, and finally return the sum of the three parser values.

The char c, charset cs, and string s parsers build parsers for recognizing individual characters, sets of characters, or single strings. charset and string can be defined from char and the other combinators, but it is convenient to have them as part of the API.

fail is the parser that always fails, never recognising anything, and (||) is the binary alternation operator – p1 || p2 recognises the union of the languages p1 and p2 do. any ps is the n-ary version – it recognises any string that any of the parsers in the list ps do.

Finally, fix f is used to define grammars recursively. If f : 'a t -> 'a t, then fix f recognises the least fixed point of f.

All of these operations are allowed to signal a TypeError if a client tries to combine them in a way which would lead to ambiguity.

Parser Implementation

Below, I give the implementation of the Parser module.

  type 'a t = { tp : Tp.t; parse : string -> int -> int * 'a }
  exception ParseFailure of int

The implementation type of a parser is actually a pair, consisting of a parser type (as defined in the Tp module), and a parsing function of type string -> int -> (int * 'a). So p.parse s i will take a string s, and begin parsing at i. If it succeeds, it will return (j,v), where j is where it finished parsing, and v is the return value of the parser. If the run of the parser failed, then it will raise the ParseFailure exception.

  let char c = 
    let p s i = 
      if i < String.length s && s.[i] = c then 
        (i+1, ())
        raise (ParseFailure i)
    { tp = Tp.char c; parse = p }

The char c parser is one of the simplest. It constructs its type with a call to the Tp.char function, and constructs a parse function which checks if the next character is c, and returns unit on success and signals failure if there is no next character or it isn't c.

  let (let+) p f = 
    let p' s i = 
      let (i, v) = p.parse s i in 
      (i, f v)
    {tp =; parse = p'}

  let map f p = 
    let+ x = p in f x 

As we said before, the parser type 'a t is functorial -- it supports a map operation. All map f p has to do is to write a new parse function which runs p.parse, and then on success wraps the return value with a call to f.

We also define a synonym (let+) to enable the applicative syntax.

  let unit = { tp = Tp.eps; parse = fun s i -> (i, ()) }

  let return x = 
    let+ () = unit in x 

  let (and+) p1 p2 = 
    let p' s i = 
      let (i, a) = p1.parse s i in 
      let (i, b) = p2.parse s i in 
      (i, (a,b))
    { tp = Tp.seq; parse = p' } 

  let seq = (and+)

unit is the simplest parser – it always succeeds without moving the position of the index into the string (ie, it recognises the empty string) and returns unit. return x is almost as simple: it also immediately succeeds, but returns the value x.

The seq (aka and+) operator sequences two parsers p1 and p2. Its parse function runs p1.parse, yielding an updated index and a value a, and then p1.parse is run starting from the new index, yielding the value b. Then, the final index is returned, alongside the pair (a,b).

Note that this function never backtracks: by construction, all of our parsers will turn out to be greedy, but this is ok since the follow set of p1 and the first set of p2 are known to be disjoint – there is never any risk that p1 will consume characters that are actually part of the start of p2. So we do not have to (as PCRE-style regexps must) do any backtracking.

The returned type simply uses Tp.seq on the two argument parsers' types.

  let charset cs = 
    let p s i = 
      if i < String.length s && C.mem s.[i] cs then 
        (i+1, s.[i])
        raise (ParseFailure i)
    {tp = Tp.charset cs; parse = p }

  let string str = 
    let p s i = 
      if i + String.length str < String.length s then 
        let rec loop j = 
          if j < String.length str then 
            if str.[j] = s.[i + j] then 
              loop (j+1)
              raise (ParseFailure (i+j))
            (i+j, ())
        loop 0 
        raise (ParseFailure i)
    {tp = Tp.string str; parse = p}        

charset cs and string s are just convenience parsers, which just check if the first character of the input is in the set cs, or whether the prefix of the output is s, respectively.

  let fail = 
    { tp =; 
      parse = fun s i -> raise (ParseFailure i) }

The fail parser always fails, no matter what the input is. This corresponds to recognising the empty language.

The alternation operator p1 || p2 below is the most complex parser definition.

  let (||) p1 p2 = 
    let p' s i = 
      if i < String.length s then 
        if C.mem s.[i] then 
          p1.parse s i 
        else if C.mem s.[i] then 
          p2.parse s i 
        else if then
          p1.parse s i 
        else if then 
          p2.parse s i 
          raise (ParseFailure i)
      else if then 
        p1.parse s i 
      else if then 
        p2.parse s i 
        raise (ParseFailure i)
    {tp = Tp.alt; parse = p' }

  let any ps = List.fold_left (||) fail ps 

Alternation is the only parser combinator that uses the information from the types to decide how to proceed. The reason is that we do not want our parsers to backtrack – if they did, performance would switch from worst-case linear to worst-case exponential. (This is the reason that classic parser combinators have such dreadful performance.)

However, this means that when we see an alternation, we have to decide whether to parse with p1 or p2. In traditional LL(k) style, We do this by looking at the first character of the input. If it is in p1's first set, we parse with p1, and if it is in p2's first set, we parse with p2.

This is easy – except for the corner cases, of which there are two main ones. The first corner case is what we do when the input string is exhausted (ie, if the position we are to start looking at is at the end of the string). In this case, the only string that could match is the empty string, so we check to see whether either p1 or p2is nullable, and then we invoke the nullable parser (if any).

The other corner case is when there is still input left, but its first character doesn't match the first set of either parser. In this case, again the only possible matching string is the empty string, and so we check nullability again.

If none of these conditions match, then the string can't be parsed and we signal an error.

Once we have alternation, we can use it and fail to define the any combinator by folding over the list of parsers with ||.

  let fix f = 
    let g t = (f {fail with tp = t}).tp in
    let r = ref (fail.parse) in 
    let p = f {tp = (Tp.fix g); parse = fun s i -> !r s i} in 
    r := p.parse;

  let parse p s i = p.parse s i

Finally, the fixed point fix f is a little annoying to define, because f : 'a t -> 'a t, and this means that it takes a pair of a type and a parse function, and returns a pair of a type and a parse function. The type-level fixed point Tp.fix wants an argument of type Tp.t -> Tp.t, and the g function in the body turns f into this kind of function using a dummy parser argument to f.

To actually implement the parsing function, we turn to a trick called Landin's knot – we manually implement recursion by backpatching. First, we define r, a pointer to a failing parse function. Then, we invoke f with a parser which takes the type of the fixed point (Tp.fix g), and a parse function which just invokes the function stored in r. This yields a new parser, which we assign to r to tie the knot on the recursion.

The reason we do this, instead of defining the parse function q as

    let rec q s i =  (f {tp = Tp.fix g; parse = q}).parse s i 

is that this definition re-invokes f on each recursive step, whereas the imperative definition only invokes f once.

That's the whole implementation of the parser library! It's just over 300 lines of code.

A Sample Parser

Below, I give the implementation of a simple s-expression parser, which recognises parenthesized sequences of words, like (foo bar (baz (quux) ())).

module Test = struct
  open Parser

  let letter = charset (C.of_string "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

  let digit = charset (C.of_string "0123456789")
  let whitespace = charset (C.of_string " \t\n")

We can use our parsers to define basic parsers that recognise single letters, digits, or whitespace.

  (* val (==>) : 'a t -> ('a -> 'b) -> 'b t 
  let (==>) p f = let+ x = p in f x 

  (* val (>>) : 'a t -> 'b t -> 'a t *)
  let (>>) p1 p2 = 
    let+ x = p1 
    and+ _ = p2 in

Next, I define a couple of utility combinators. ==> is an infix spelling of map, whose utility will become clear shortly.

p1 >> p2 is like seq, parsing with p1 and then p2, except that in this case the value returned by p2 is discarded. This is useful for doing things like discarding trailing whitespace.

  (* val star : 'a t -> 'a list t *)
  let star p = 
    fix (fun r -> 
           any [ 
               unit      ==> (fun () -> []); 
               (seq p r) ==> (fun (x, xs) -> x :: xs)

  (* val starskip : 'a t -> unit t *)
  let starskip p = 
    fix (fun r -> 
          any [ unit      ==> (fun () -> ());
                (seq p r) ==> (fun _ -> ()) ])

The star p combinator parses with p as many times as it can, and then returns a list of all the p-values that it was able to parse. starskip p is similar, except that it ignores its return value, always returning unit ().

  (* val symbol : string t *)
  let symbol = 
    let+ c = letter
    and+ cs = star letter 
    and+ _ = starskip whitespace 
    let b = Buffer.create 0 in 
    List.iter (Buffer.add_char b) (c :: cs);
    Buffer.contents b

  (* val paren : 'a t -> 'a t *)
  let paren p = 
    let+ () = char '(' >> starskip whitespace 
    and+ x = p 
    and+ () = char ')' >> starskip whitespace 

The symbol parser parses a nonzero list of letters, followed by any amount of whitespace, and converts the list of characters into a string. This does allocate a lot more than it should, with one allocation per character – but for a blog post it's fine.

The paren p combinator call wraps the language recognised by p in parentheses. We also follow the calls to char '(' and char ')' with a starskip whitespace. This is one of the annoying things about writing scannerless parsers – you have to handle whitespace manually.

The discipline I use is to follow each logical token with trailing whitespace, but to never precede logical tokens with whitespace. This ensures that whitespace is never in the first set of a grammatical production, which makes sequencing work out reasonably well. It's still a PITA relative to explicitly tokenized lexer/parser pairs, though.

Finally, we can define the sexp type and parser.

  type sexp = 
    | Sym of string
    | Seq of sexp list 

  (* val sexp : sexp t *)
  let sexp = 
    fix (fun r -> 
          any [ 
              symbol         ==> (fun s -> Sym s);
              paren (star r) ==> (fun xs -> Seq xs)

An s-expression is either a symbol Sym s, or a list of sexpressions Seq ss. The parser for this is literally this definition – it is either symbol (which returns Sym s), or it is a parenthesized list of s-expressions (derived using star on the r recursive combinator.

This also demonstrates why I like the infix map ==> – it looks quite a bit like clausal definitions in pattern matching, or the parser actions in Menhir or Yacc.


So how fast does it run? Well, to test this out I generated some large s-expressions and parsed them. This was all benchmarked on an Intel i7-6700 running at 3.4 GHz, running Ubuntu 22.04, with the OCaml code all compiled to native code with ocamlopt.

When I run the parser combinator library on a 4.5 megabyte string, I get the following performance numbers.

  • String length: 4498771 bytes
  • Parser elapsed time: 0.736 sec
  • Parsing rate: 6.12e+06 bytes/sec

On a string 10x bigger, the performance of the combinators is slightly faster than a linear speedup would predict.

  • String length: 45007848 bytes
  • Parser elapsed time: 5.571 sec
  • Parsing rate: 8.08e+06 bytes/sec

I don't know why the 45 megabyte string parsers faster than the 4.5 megabyte one. There shouldn't be any fixed latencies of any significance, so I can only chalk it up to garbage collection weirdness.

I also wrote a small ocamllex/Menhir parser, and benchmarked it. The Menhir parser showed similar tendencies, though its absolute speed is higher. For the 4.5 megabyte s-expression, we get:

  • String length: 4498771 bytes
  • Parser elapsed time: 0.492 sec
  • Parsing rate: 9.14e+06 bytes/sec

For the 45 megabyte string, we get:

  • String length: 45007848 bytes
  • Parser elapsed time: 2.989 sec
  • Parsing rate: 1.51e+07 bytes/sec

Again, parsing a longer string shows a more-than-linear speedup. Since LR(1) parsing, like LL(1), is a linear time algorithm, this indicates some stuff going on in the runtime.

Comparing like-for-like, this little parser combinator library is 2/3 the speed of Menhir for the 4.5 megabyte file, and slightly more than 1/2 the speed on the 45 megabyte file.

This is much better performance than I expected! Because parser combinators are essentially big nested closures, I expected there to be a heavy performance penalty because we basically have to do an indirect jump for every character in the input.

This should be a recipe for Python-esque slowness, but I guess the processor architects at Intel really know how to design indirect branch predictors! I checked Agner Fog's optimisation guide, but apparently the design of the branch predictor in Skylake is undocumented. All Fog can say is that "Indirect jumps and indirect calls are predicted well."

The net result of this is that I have to update my beliefs: before my birthday, I thought that parser combinators were basically a bad idea. But now I think they can be a very reasonable parsing technique, even without doing fun stuff like using staging to achieve parser/lexer fusion!

Tuesday, November 15, 2022

CN: Verifying Systems C Code with Separation-Logic Refinement Types

We have a new paper on combining separation logic and refinement types to verify C code, appearing at POPL 2023 in a couple of months. It's called CN: Verifying Systems C Code with Separation-Logic Refinement Types, and it's by Christopher Pulte, Dhruv C. Makwana, Thomas Sewell, Kayvan Memarian, Peter Sewell, and me.

Despite significant progress in the verification of hypervisors, operating systems, and compilers, and in verification tooling, there exists a wide gap between the approaches used in verification projects and conventional development of systems software. We see two main challenges in bringing these closer together: verification handling the complexity of code and semantics of conventional systems software, and verification usability.

We describe an experiment in verification tool design aimed at addressing some aspects of both: we design and implement CN, a separation-logic refinement type system for C systems software, aimed at predictable proof automation, based on a realistic semantics of ISO C. CN reduces refinement typing to decidable propositional logic reasoning, uses first-class resources to support pointer aliasing and pointer arithmetic, features resource inference for iterated separating conjunction, and uses a novel syntactic restriction of ghost variables in specifications to guarantee their successful inference. We implement CN and formalise key aspects of the type system, including a soundness proof of type checking. To demonstrate the usability of CN we use it to verify a substantial component of Google’s pKVM hypervisor for Android.

There's some cool separation logic trickery that might become a blogpost, and of course the CN tool is open source, and is publically available as a backend of the Cerberus executable C semantics. Naturally I (and most of my coauthors) will be in Boston this January for POPL, and the talk about CN will be public then. If you can't wait, earlier this summer I gave a talk about CN which was recorded and on Youtube.

Friday, November 4, 2022

Two Papers about Refinement Types

The rule of thumb I use is that Noam Zeilberger is generally five to ten years ahead of me in identifying interesting problems. A decade ago he was working intensively on the semantics of refinement types, and lo and behold, in the last couple of years so have I. So I'd like tto tell you about two draft papers developing both the theory of refinement types, and how to apply them to verification problems!

  • Focusing on Liquid Refinement Typing, Dimitrios J. Economou, Neel Krishnaswami, Jana Dunfield.

    We present a foundation systematizing, in a way that works for any evaluation order, the variety of mechanisms for SMT constraint generation found in index refinement and liquid type systems. Using call-by-push-value, we design a polarized subtyping relation allowing us to prove that our logically focused typing algorithm is sound, complete, and decidable, even in cases seemingly likely to produce constraints with existential variables. We prove type soundness with respect to an elementary domain-theoretic denotational semantics. Soundness implies, relatively simply, our system’s totality and logical consistency.

    In this paper, my coauthors and I decided to take a look at Liquid Types, which is one of the most successful offshoots of the refinement types family tree. Two things make it particularly powerful.

    First, Liquid Types are engineered from the ground up for decidable typechecking. This ensures you never have to pray that a fragile pile of heuristics will be good enough to typecheck your program. Second, they extend the DML/Stardust style of refinement types with the idea of “measure functions”, which let you compute SMT predicates by recursion on the structure of your program data. This is a really potent extension which brings all kinds of deep functional correctness properties into reach for automated verification.

    So we decided to build a small refinement type system on top of a call-by-push-value calculus that exemplifies these two properties. We then give an algorithmic type system (proved sound and complete with respect to the declarative spec), and then give a denotational semantics to our calculus, which shows that our refinement type system is consistent.

  • Explicit Refinement Types, Jad Ghalayini and Neel Krishnaswami.

    We present 𝜆ert, a type theory supporting refinement types with explicit proofs. Instead of solving refinement constraints with an SMT solver like DML and Liquid Haskell, our system requires and permits programmers to embed proofs of properties within the program text, letting us support a rich logic of properties including quantifiers and induction. We show that the type system is sound by showing that every refined program erases to a simply-typed program, and by means of a denotational semantics, we show that every erased program has all of the properties demanded by its refined type. All of our proofs are formalised in Lean 4.

    In this paper, we take the opposite tack from the previous paper, in which we carefully restricted the system of refinements to ensure that an SMT solver (like Z3) could always discharge the constraints. SMT solvers, while powerful, struggle with logical features (such as quantifiers and induction) that are outside the core decidable fragment. (In fact, they basically have to struggle, because full first-order logic is undecidable!)

    So in this paper, Jad and I designed a refinement type system where there are no solvers at all – all logical obligations are discharged with explicit proofs. As a result, our language of refinements easily extends to quantified formulas (eg, we can express that a function is monotonic very easily).

One interesting thing is that both of these papers use very low-tech denotational semantics. The first paper uses elementary domain theory, and the second is even simper – it uses plain old sets ands functions. This was partly done to maximise the accessibility of our results – it’s sort of folklore that refinement types have to have complicated semantics, and I hoped we could push back against that belief.

However, Paul-André Melliès and Noam Zeilberger have a beautiful paper, Functors are type refinement systems, in which they give a uniform, structured treatment of refinement types in terms of fibrations. Our semantics in the two papers above can really be seen as what you get if you inlined all the abstractions in Noam and Paul-André’s paper. This makes our results elementary, but at the price of generality. Particularly if we want to pursue mechanised proofs further, then it would be a good idea to work explicitly with the fibrational interface, because that could let us change the semantics without having to go back and re-prove all of the properties relating syntax to semantics.

Tuesday, September 13, 2022

The Golden Age of PL Research

I was chatting with a PhD student while back, who remarked to me that many senior PL researchers seemed stuck in a defensive crouch. I though that was quite a striking observation, because (a) he was not wrong, but (b) people his age don't have or need that posture because we are currently in a golden age for PL research.

What I mean by that is that there was a period where the research tradition in programming languages nearly died outright in the US, and a lot of our more senior researchers remember this very keenly. Basically, AI winter and cheap microcomputers killed the Lisp machine, and as collateral damage managed to nearly wipe out PL research in America -- many PL researchers in the US ended up moving overseas or changing their research focus.

However, over the course of about a decade (from the mid 1990s to the mid 2000s), the field made the transition from being on life support to being very vigorous. I started grad school towards the end of this transition period, and so I ended up with first hand view of the change.

Here are some of the factors that triggered the transition, both practical and theoretical. (Note that the thing about a first-hand view is that it is not a global view: I am sure other people will have different opinions about what happened!)

  1. Syntactic Type Soundness Before Felleisen and Wright's 1994 paper A Syntactic Approach to Type Soundness appeared, proving type safety was surprisingly hard.

    E.g., if you look at Milner's A Theory of Type Polymorphism in Programming, you will see that what he does to prove type safety is:

    1. Solve a domain equation for a universal domain containing all the features of mini-ML
    2. Give a denotational interpretation of untyped mini-ML
    3. Define a logical relation as a family of retracts of the universal domain which exclude "wrong"
    4. Prove the fundamental theorem of logical relations.

    This is a lot more work than the Wright-Felleisen approach, in which you just write down the typing rules, the reduction relation, and prove a handful of lemmas about it.

    The technical simplicity of the Wright-Felleisen approach meant that it was very easy to apply PL design techniques to nearly arbitrary domains, and that made it possible to quickly build formal models of industrially interesting languages. Igarashi et al's Featherweight Java is probably one of the earliest example of this, and more recently WebAssembly has demonstrated that it is possible to build the formal spec in sync with the language design effort.

  2. Java Java was released in 1996, and its immediate success had a critical impact in popularising garbage collection. Before Java, using languages with garbage collection for application programming was an extremely controversial idea among working programmers. Guy Steele (who co-authored the Java spec) remarked about this on the ll1-discuss mailing list:

    We were not out to win over the Lisp programmers; we were after the C++ programmers. We managed to drag a lot of them about halfway to Lisp.

    Java also ended up dragging programmers halfway to ML! In 1998, Gilad Bracha, Martin Odersky, David Stoutamire and Phil Wadler proposed adding parametric polymorphism to Java, and in 2004 after 5 years of arguments it was added in Java 5. This also made a big impact in popularising advanced type systems features.

  3. Dynamic Languages Languages like Perl, Python, Ruby, and Javascript (in roughly that order) began achieving serious popularity in the late 90s/early 2000s, primarily as tools for building websites. Like Java, all of these languages had garbage collection (even if only reference counting), which helped normalise it.

    But in addition to automatic memory management, these languages all supported first-class functions of one variety or another, and so gave literally millions of programmers their first introduction to functional programming.

    Younger programmers will be surprised by just how controversial an addition first-class functions were! Java omitted it initially, despite its designers being expert Lisp and Smalltalk programmers. Even C#, which was released 6 years after Java, did not support them at first. IMO it was only because of the experience people had with dynamic languages that created the pressure to add them.

    Nowadays, thanks to the influence of dynamic languages on the design zeitgeist, even languges whose design aesthetic is deliberately conservative, such as Go, include first-class functions without a second thought.

  4. SAT and SMT solvers CDCL-based SAT solvers was invented in the mid 1990s, and it revolutionised SAT solving. A few years later, by the early 2000s, the lazy approach to SMT was invented, which made it possible to reliably check whether quantifier-free formulas from all kinds of mixed domains were valid.

    This has proved to be an immense boon for PL research across a wild variety of different research areas, from verification of operatiing systems code like the Verve project, the KLEE symbolic execution engine, and refinement type systems like Liquid Haskell.

  5. Separation logic Hoare logic, invented in 1968, was one of the first systematic approaches to proving the correctness of imperative programs. But despite serious efforts to use it in verification, such as the Euclid system, it never quite caught on.

    The major reason for this was that it was very difficult to reason about pointers and aliasing with Hoare logic. Every Hoare logic assertion talks about the whole state, and so you need to write down explicit disjointness assertions to make clear that modifying one pointer will not change the value of a different pointer. So if you have n pointers, then you need O(n^2) disjointness assertions.

    Separation logic, invented by John Reynolds and Peter O'Hearn, solved this problem, by formalising the intuition that an assertion could really only be talking about part of the heap. This made modular reasoning about imperative programs radically easier, and has been used in numerous verification efforts since.

    One minor fact about separation logic. John C. Reynolds invented separation logic when was 65. At the time that most people start thinking about retirement, he was making yet another giant contribution to the whole field!

  6. Mature Proof Assistants Work on the Coq proof assistant began in 1984, but it went through multiple successive rounds of reimplementation as it influenced and was influenced by competing systems like LEGO, HOL and Isabelle. It probably achieved a recognizably modern form in 2002, when the 7.0 release was made.

    This was a milestone in terms of quality of implementation, since this is when (I think -- I wasn't there!) term normalisation started to happen via a bytecode VM. This made many things (like reflective proof) much faster, which made it feasible to build really large proofs in Coq. Indeed, Georges Gonthier reports that his 2005 mechanised proof of the 4-colour theorem began as an experiment to stress test the new fast VM!

    The result that made CS researchers outside of PL stand up and take notice was not the four color theorem, though. It was Xavier Leroy's correctness proof of a C compiler in the CompCert project, in 2006. People have been talking about program verification since the 1960s, and this was one of the first big programs that was actually verified -- and in a proof assistant, no less.

    Nowadays, you can go to a systems conference like SOSP and find whole sessions about verification, which would have been unthinkable when I was a student.

    Coq is not the only proof assistant, of course, and all its competitor systems (e.g., Isabelle) have seen substantial improvements in their maturity as well.

Some other important, but maybe less vital, factors:

  1. Step-indexing In 2001, McAllester and Appel showed how you could build a logical relations model using "step-indexing" as an alternative to the domain theoretic approach. This is a less general, but more elementary and flexible, approach to modelling the semantics of recursion than techniques like domain theory. The flexibility meant that languages with features like ML-style modules, machine code, and self-modifying code could all be given straightforward models.

    One small thing that makes me smile is that this is now the modern approach to proving type safety for fancy languages -- and that it is very similar to the approach the Wright/Felleisen approach displaced! I feel (i.e., believe but don't have a real argument for) that you need a big enough community or ecosystem to support a technically sophisticated research tradition, and now that PL is a bigger field again we can once more consider using advanced techniques.

  2. Flexible Type Analysis, Crary and Weirich, ICFP 1999 Compared to some of the other papers I've linked to, this paper is mostly forgotten -- according to Google Scholar, it has less than 10 citations in the last five years. However, IMO it really is a pivotal paper.

    What makes it so very important is that after writing this paper, Stephanie Weirich began collaborating with Simon Peyton-Jones. Stephanie, Simon, and their students and collaborators, began systematically extending GHC Haskell with the features described in this paper. And once they finished with that they carried on with many more extensions!

    This has been an immensely productive collaboration. It's why Haskell has grown so many "fancy" type system features, and in turn it has made it possible to actually try out many things in the literature in actual programming. The value of this can't be overstated: a huge number of PL researchers got their first introduction to what type theory can do from the advanced extensions to GHC.

  3. Model Checking This goes under "smaller factors" not because it's actually a smaller factor, but because I am very ignorant about the state of the art! It's long been important in hardware, but in software, the biggest success I know of is Microsoft's Static Driver Verifier, which is used to model check drivers before they are certified for Windows. This is obviously a big deal, but I can't really contextualise this because I don't know much about model checking. However, I am under the impression that SAT/SMT-based bounded model checking supplanted the older BDD-based approach, which would mean that model checking also benefited from the SAT/SMT revolution.

The net effect of all this is that a motivated, ambitious student can aim at problems which simultaneously have significant theoretical depth and substantial practical impact -- stuff that would have been field-shaking results 20 years ago are almost routine now.

For example, Maja Trela, one of my recently-graduated Part III (i.e., Master's) students here at Cambridge, just completed her dissertation under the primary supervision of Conrad Watt. In her dissertation, she used a separation logic library for Isabelle to verify the correctness of an efficient Webassembly interpreter, which is now being used as the reference interpreter by the Bytecode Alliance!

Thursday, March 17, 2022

Fold Considered Annoying

I recently read Shriram Krishamurthi and Kathi Fisler's ICER 2021 paper, Developing Behavioral Concepts of Higher-Order Functions. In this paper, they study not the theory of higher-order functions, bur rather the pedagogy of higher-order functions: how do we teach students how to program with them?

In particular, they studied how students understood different higher-order functions -- e.g., map, filter, fold and co. -- by asking them to perform tasks like classifying which functions were similar, which example input/outputs could conceivably be produced by particular HOFs, and so on.

Since this is a blog, I can say that one thing I really liked about this paper is that it confirmed my prejudices! In particular, they found that fold was much harder for students to understand than any of the other higher-order functions they studied.

This is something I have long believed, both from anecdotal evidence and for mathematical reasons. I'll pass over the anecdotes (since the paper has actual data), and focus instead on what I see as the core mathematical issue: the specification of fold is dramatically more complex than the specification of map or filter.

Both map and filter have extremely simple equational specifications. For example, map is characterised by the following three laws:

map f []         = []
map f (xs ++ ys) = (map f xs) ++ (map f ys)
map f [x]        = [f x]

Basically, this says that map takes nils to nils, distributes over appends, and applies f to elements pointwise. All three of these equations are universally-quantified, unconditional equations, which is about as simple as a specification can get.

The specification for filter is only a little more complex:

filter p []         = []
filter p (xs ++ ys) = (filter p xs) ++ (filter p ys)
filter p [x]        = if p(x) then [x] else []

filter does nothing to nils, distributes over appends, and for each element either keeps it or discards it. Again, All three of these equations are universally-quantified, unconditional equations.

On the other hand, what is the specification for fold? That is given by its universal property, which reads:

 [h [] = nil ∧
  (∀x, xs. cons x (h xs) = h (x :: xs))]  // Uh-oh!
(fold nil cons = h)

This is not an equational specification at all! It is a bi-implication, and while the right-to-left direction is basically the definition of fold, the left-to-right direction contains a formula with a logical quantifier in the premise. We know from nearly a century of experience with high school calculus that quantifier alternations are a giant obstacle to understanding: every year thousands of students find themselves utterly baffled by epsilon-delta arguments, and here we have something harder – we have an implication with a universally quantified premise.

So I've always been very careful whenever I teach students about the fold function, because I think it has a very challenging spec for students to wrap their heads around.

If you're watching out for quantifiers, that also suggests that take-while will also be one of the more challenging for functions students – takewhile p xs returns the longest prefix of xs satisfying the predicate p, and "longest prefix" is going to be involve another higher-rank quantified formula.

The evidence in the paper seems weakly consistent with this idea, in that they report that take-while does seem harder for students, but it does not seem to be as much harder as I would have expected.

Anyway, I found this a really interesting paper!