Thursday, December 30, 2021

What is a programming language?

One of the more surprising facts about the discipline of programming language theory is that it is actually possible to define what programming languages are in a reasonably mathematically satisfying way.

  1. A language is a presentation of a (generalised) algebraic theory. Basically, think of a language as a set of generators and relations in the style of abstract algebra.

    You need to beef up the universal algebra story a bit to handle variables and binding (e.g., see the work of Fiore and Hamana on higher-order algebraic theories), but the core intuition that a language is a set of generators for terms, plus a set of equations these terms satisfy is exactly the right one.

    For example:

    1. the simply-typed lambda calculus
    2. regular expressions
    3. relational algebra
  2. A model of a a language is literally just any old mathematical structure which supports the generators of the language and respects the equations.

    For example:

    1. we can model the typed lambda calculus using sets for types and mathematical functions for terms,
    2. we can model regular expressions as denoting particular languages (ie, sets of strings)
    3. we can model relational algebra expressions as sets of tuples
  3. A model of computation or machine model is basically a description of an abstract machine that we think can be implemented with physical hardware, at least in principle. So these are things like finite state machines, Turing machines, Petri nets, pushdown automata, register machines, circuits, and so on. Basically, think of models of computation as the things you study in a computability class.

    Nearly all models of computation are transition systems, where you have a set of states, and a transition function which tells you how a computation can change one state into another. There are lots of generalisations of this (relations instead of functions, I/O, system calls, message passing communication, etc.), and the slick way of accounting for this generality is by modelling transition systems as coalgebras (i.e., the set S paired with a coalgebra map step : S → F(S), where F is a functor.

    In addition to this, every model of computation also comes with a notion of observation – in its most basic form, how do you tell when a program has finished executing? Observations can get quite intricate: to model computational complexity, you have to be able to observe runtime and memory usage, and for concurrent programs you can observe intermediate program states (due to the possibility of concurrent interference). More practically, the compiler correctness theorem for something like CompCert says that the sequence of system calls the compiled binary produces will be the same as the one the C abstract machine will produce.

    I don't know a nice algebraic/categorical characterisation of the notion of observation, though probably Alexandra Silva does.

    The Church-Turing thesis characterises which abstract machines we think it is possible to physically implement.

    Some examples of models of computation are:

    1. Turing machines,
    2. finite state machines
    3. RAM machine programs
  4. A language is a programming language when you can give at least one model of the language via a machine model.

    For example:

    1. the types of the lambda calculus can be viewed as partial equivalence relations over Gödel codes for some universal turing machine, and the terms of a type can be assigned to equivalence classes of the corresponding PER.

    2. Regular expressions can be interpreted into finite state machines quotiented by bisimulation.

    3. A set in relational algebra can be realised as equivalence classes of B-trees, and relational algebra expressions as nested loops walking over these trees.

    Note that in all three cases we have to quotient the states of the machine model by a suitable equivalence relation to preserve the equations of the language's theory.

    This quotient is very important, and is the source of a lot of confusion. It hides the equivalences the language theory wants to deny, but that is not always what the programmer wants – e.g., is merge sort equal to bubble sort? As mathematical functions, they surely are, but if you consider them as operations running on an actual computer, then we will have strong preferences! (Conversely, the ability to equate radically different machine configurations is the reason why optimising compilers and refactoring are things.)

    Moreover, we have to set up this quotient so it agrees with the machine model's notion of observation – for example, we want exactly and only the strings accepted by a regular expression to end in accepting states of the finite state machine. This property is called adequacy, and is what justifies the claim that this is a reasonable implementation of the programming language.

In other words, a programming language is a language which has an adequate implementation in a model of computation. Note that this definition does not mandate the implementation strategy!

A common source of confusion arises from the fact that if you have a nice type-theoretic language (like the STLC), then:

  1. the term model of this theory will be the initial model in the category of models, and
  2. you can turn the terms into a machine model by orienting some of the equations the lambda-theory satisfies and using them as rewrites.

These oriented equations are where operational semantics comes from, at least for purely functional languages. As a result we often abuse language to say the simply typed lambda calculus "is" a programming language.

But in fact, the lambda calculus is a presentation of an algebraic theory, and it is the same programming language regardless of whether it is implemented by term rewriting or by compilation to x86 machine code.

Wednesday, December 1, 2021

Obliteratingly Fast Parser Combinators

Jeremy Yallop, Ningning Xie, and I have a new draft out about combinator parsing, entitled Fusing Lexing and Parsing. The abstract is short, but sweet:

Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity, but harmful for performance.

We show how to fuse together separately-defined lexers and parsers, drastically improving performance without compromising modularity. Our staged parser combinator library, flap, provides a standard parser combinator interface, but generates psecialized token-free code that runs several times faster than ocamlyacc on a range of benchmarks.

I'm excited about this for three reasons.

First, even though the performance is way better this time around, going from competitive with lex+yacc to 3-9 times faster, the implementation is actually simpler than in our 2019 paper – we don't need a complex staging pipeline because we introduce an IR which makes all the invariants needed for fast code generation lexically apparent.

Second, really fast parsers are potentially a powerful tool for applications like IDEs. If you are fast enough, you are free from having to mess around with things like incremental parsing, because if you can parse a whole editor buffer in a millisecond it's fine to just reparse everything on every keystroke. Imagine what could happen if we could fused in typechecking, too! That's currently just wishful thinking (aka "future work"), of course, but since typechecking is mostly a tree traversal, it's not totally hopeless.

Finally, at a meta-level this is a good illustration of the incorrectness of the standard engineer proverb "fast, good, cheap – pick two". This advices presumes that we lie on the Pareto frontier, so the best you can do is manage tradeoffs. But that's actually a false assumption – it is possible to do things which are unconditionally better along all three dimensions.

No source code is ready for release yet, but Jeremy is likely to release an OPAM package for it in due course.