Friday, August 10, 2018

Polarity and bidirectional typechecking

This past July, Max New asked me about the relationship between bidirectional typechecking and semantic notions like polarization and call-by-push-value. I told him it was complicated, and that I would try to write something about the relationship. I was reminded of this since a couple of days ago, Conor McBride wrote a blog post where he laid out how he approaches bidirectional type systems. This is a fantastic post, and you should read it (probably twice or more, actually).

In his post, he remarks:

I like my type systems to be bidirectional. Different people mean different things by the word “bidirectional”, and I’m one of them. Some people relate the notion to “polarity”, and I’m not one of them.

I'm someone who would like to join the club Conor abjures, but I don't know how to!

It sure looks like bidirectional typechecking has a deep relationship to polarization or call-by-push-value, but I don't know how to make it work correctly. So, in this post is to explain why people think it does some deep semantic import, and then I'll talk about the mismatch that creates a problem I don't know the right way to handle. (Also, I apologize in advance for the lack of citations -- I really want to get this post out the door. If I didn't cite a paper you like (or even wrote), please link to it in the comments.)

The reason that people think that bidirectional typechecking must have a deep semantic meaning arises from how it works out in the simply-typed lambda calculus. Let's try writing some things down and seeing what happens. First, recall that bidirectional typechecking categorizes all terms as either "checking" or "synthesizing" and introduces two mutually recursive typing judgements for them.

\[\newcommand{\bnfalt}{\;\;|\;\;} \newcommand{\To}{\Rightarrow} \newcommand{\From}{\Leftarrow} \newcommand{\rule}[2]{\frac{\displaystyle \begin{matrix} #1 \end{matrix}}{\displaystyle #2}} \newcommand{\check}[3]{{#1} \vdash {#2} \From {#3}} \newcommand{\synth}[3]{{#1} \vdash {#2} \To {#3}} \newcommand{\lam}[1]{\lambda {#1}.\;} \newcommand{\inj}[2]{\mathsf{in}_{#1}({#2})} \newcommand{\case}[5]{\mathsf{case}({#1}, \inj{1}{#2} \to {#3}, \inj{2}{#4} \to {#5})} \newcommand{\match}[2]{\mathsf{match}\;{#1}\;\mathsf{of}\;[ #2 ]} \newcommand{\arm}[2]{{#1} \to {#2}} \newcommand{\downshift}[1]{\downarrow{#1}} \newcommand{\upshift}[1]{\uparrow{#1}} \newcommand{\thunk}[1]{\left\{{#1}\right\}} \newcommand{\return}[1]{\mathsf{return}\;#1} \newcommand{\fun}[1]{\lambda\;#1} \newcommand{\checkn}[3]{{#1} \rhd {#2} \From {#3}} \newcommand{\checkp}[2]{{#1} \leadsto {#2}} \newcommand{\spine}[4]{{#1} \vdash {#2} : {#3} \gg {#4}} \newcommand{\unit}{()} \]

\[ \begin{array}{llcl} \mbox{Types} & A & ::= & b \bnfalt A \to B \\ \mbox{Contexts} & \Gamma & ::= & \cdot \bnfalt \Gamma, x:A \\ \mbox{Checking} & e & ::= & \lam{x}{e} \bnfalt t \\ \mbox{Synthesizing} & t & ::= & x \bnfalt t\;e \bnfalt e : A \\ \end{array} \]

We'll start with a pretty minimal calculus -- we've got a base type \(b\), and functions. Contexts work as usual, giving variables their types, but terms are divided into checking terms and synthesizing terms. Checking terms are either introduction forms (just lambda-abstraction, here) or any synthesizing term (the intuition is that if we can infer a type for a term, we can surely check its type as well). Synthesizing terms are either variables (we can just look up their type in the context) applications, or explicitly annotated terms.

\[\begin{matrix} \rule{ \check{\Gamma, x:A}{e}{B} } { \check{\Gamma}{\lam{x}{e}}{A \to B} } & \rule{ \synth{\Gamma}{t}{A \to B} & \check{\Gamma}{e}{A} } { \synth{\Gamma}{t\;e}{B} } \\[1em] \rule{ \synth{\Gamma}{t}{A} & A = B } { \check{\Gamma}{t}{B} } & \rule{ \check{\Gamma}{e}{A} } { \synth{\Gamma}{e : A}{A} } \\[1em] \rule{ x:A \in \Gamma } { \synth{\Gamma}{x}{A} } & \end{matrix}\]

You can see the intro/elim pattern in the first line -- when we check a lambda-abstraction (an introduction form) against a type \(A \to B\), we put the variable \(x\) in the context at type \(A\), and check the body at type \(B\). When we apply a function, we first infer a type \(A \to B\) for the function expression, which gives us a type \(A\) to check the argument against.

So far, this is pretty standard stuff. Now, let's tweak the rules slightly.

\[\begin{matrix} \rule{ \check{\Gamma, x:A}{e}{B} } { \check{\Gamma}{\lam{x}{e}}{A \to B} } & \rule{ \synth{\Gamma}{t}{A \to B} & \check{\Gamma}{e}{A} } { \synth{\Gamma}{t\;e}{B} } \\[1em] \rule{ \color{red}{\synth{\Gamma}{t}{b}} } { \color{red}{\check{\Gamma}{t}{b}} } & \color{red}{\mbox{(No annotation rule)}} \\[1em] \rule{ x:A \in \Gamma } { \synth{\Gamma}{x}{A} } & \end{matrix}\]

Now, we've made two changes. First, we've deleted the annotation rule, and second, we've restricted the checking/synthesis switch to only occur at base types \(b\).

  1. The effect of deleting the annotation rule is that it is not longer possible to write beta-reducible terms. A term in the simply-typed lambda-calculus written \((\lam{x:b}{x})\;y\) which reduces to \(y\) would be written \((\lam{x}{x} : b \to b)\;y\) in a bidirectional system -- but without annotations these terms can no longer be written.

  2. The effect of restricting the check/synth switch rule is that it is no longer possible to write eta-expandable terms. If \(f : b \to b \to b\) and \(x : b\), then the term \(f\;x\) would synthesize the type \(b \to b\) in the original system. However, it no longer typechecks in our restricted system. To make it work, we have to eta-expand the term, so that we write \(\lam{y}{f\;x\;y}\) instead. This now checks against \(b \to b\) as we expect.

So the joint-effect of these two restrictions is that only beta-normal, eta-long terms typecheck. The reason these terms are so important is that any two beta-eta equal terms will have the same normal form. So having a simple, easy characterization of these normal forms is really great! Moreover, this characterization is easy to extend to products:

\[\begin{matrix} \rule{ \check{\Gamma, x:A}{e}{B} } { \check{\Gamma}{\lam{x}{e}}{A \to B} } & \rule{ \synth{\Gamma}{t}{A \to B} & \check{\Gamma}{e}{A} } { \synth{\Gamma}{t\;e}{B} } \\[1em] \rule{ } { \check{\Gamma}{()}{1} } & \mbox{(No unit elimination rule)} \\[1em] \rule{ \check{\Gamma}{e_1}{A_1} & \check{\Gamma}{e_2}{A_2} } { \check{\Gamma}{(e_1, e_2)}{A_1 \times A_2} } & \rule{ \synth{\Gamma}{e}{A_1 \times A_2} & i \in \{1,2\} } { \synth{\Gamma}{\pi_i(e)}{A_i} } \\[1em] \rule{ \synth{\Gamma}{t}{b} } { \check{\Gamma}{t}{b} } & \mbox{(No annotation rule)} \\[1em] \rule{ x:A \in \Gamma } { \synth{\Gamma}{x}{A} } & \end{matrix}\]

This type system still characterizes normal forms in the STLC with units and products. Adding these constructors starts to give us a pattern:

  1. Introduction forms (lambda-abstractions, pairs, units) are checking.
  2. Elimination forms (applications, projections) are synthesizing.

Since units + pairs + functions are syntax for everything in cartesian closed categories, this is actually pretty wonderful. We seem to have a simple rule for characterizing beta-normal, eta-long forms.

But what happens when we try to add sums to the language? Let's try to follow our recipe, and see what happens:

\[\begin{matrix} \rule{ \check{\Gamma}{e}{A_i} & i \in \{1,2\} } { \check{\Gamma}{\inj{i}{e}}{A_1 + A_2} } & \rule{ \synth{\Gamma}{t}{A_1 + A_2} & \check{\Gamma, x_1:A_1}{e_1}{C} & \check{\Gamma, x_2:A_2}{e_2}{C} } { \check{\Gamma}{\case{t}{x_1}{e_1}{x_2}{e_2}}{C} } \end{matrix}\]

The introduction form seems to work. The elimination form is a bit more complicated -- it's the same syntax as always, but the checking/synthesis moding is a bit subtle. The expectation created by units/pairs/functions would be that both the scrutinee and the whole case form should synthesize, but expecting two branches with different contexts (i.e., \(\Gamma, x_1:A_1\) for \(e_1\) and \(\Gamma, x_2:A_2\) for \(e_2\)) to synthesize the same type is a morally dubious expectation (eg, it would not make sense in a dependently-typed setting). So we are led to say that case is checking, but that the scrutinee is synthesizing.

This imposes some restrictions on what does and doesn't count as a typeable term. For example, because \(\mathsf{case}\) is checking rather than synthesizing, we can never write an expression like:

\[a:((b \to A) + (b \to A)), x:b \vdash \case{a}{f}{f}{g}{g}\;x \From A\]

Instead of applying an argument to a case expression of function type, we have to push the arguments into the branches:

\[a:((b \to A) + (b \to A)), x:b \vdash \case{a}{f}{f\;x}{g}{g\;x} \From A\]

From the point of view of typing normal forms, this actually doesn't seem too bad, because most people would consider the second term simpler than the first, and so this gives us a "nicer" notion of normal form. However, this doesn't seem like a real explanation, since our rules permit things like the following:

\[\synth{f : b \to b, x:b+b}{f\;\case{x}{y}{y}{z}{z}}{b}\]

To get to a better explanation before the heat death of the universe, I'm going to skip over about 20 years of research, and jump straight to polarized type theory.

\[\begin{matrix} \mbox{Positive Types} & P & ::= & 1 \bnfalt P \times Q \bnfalt P + Q \bnfalt \downshift{N} \\ \mbox{Negative Types} & N & ::= & P \to N \bnfalt \upshift{P} \\ \mbox{Values} & v & ::= & () \bnfalt (v,v) \bnfalt \inj{i}{v} \bnfalt \thunk{t} \\ \mbox{Spines} & s & ::= & \cdot \bnfalt v\;s \\[1em] \mbox{Terms} & t & ::= & \return{v} \bnfalt \fun{\overrightarrow{\arm{p_i}{t_i}}} \bnfalt \match{x \cdot s}{\overrightarrow{\arm{p_i}{t_i}}} \\ \mbox{Patterns} & p & ::= & () \bnfalt (p,p') \bnfalt \inj{i}{p} \bnfalt \thunk{x} \\[1em] \mbox{Contexts} & \Gamma,\Delta & ::= & \cdot \bnfalt \Gamma, x:N \\ \mbox{Typing Values} & & & \check{\Gamma}{v}{P} \\ \mbox{Typing Spines} & & & \spine{\Gamma}{s}{N}{M} \\ \mbox{Typing Terms} & & & \checkn{\Gamma}{t}{N} \\ \mbox{Typing Patterns} & & & \checkp{p:P}{\Delta} \\ \end{matrix}\]

The key idea in polarized type theory is to divide types into two categories, the positive types (sums, strict products, and suspended computations, denoted by \(P\)) and the negative types (basically, functions, denoted by \(N\)). Positive types are basically those that are eliminated with pattern matching, and the negative types are the ones that are eliminated by supplying arguments. Negatives types can be embedded into positive types using the "downshift" type \(\downshift{N}\) (representing suspended computations) and positive types can be embedded into the negatives using the "upshift" \(\upshift{P}\) (denoting computations producing \(P\)'s).

The funny thing about this setup is that despite arising from meditation upon invariants of proof theory, we end up with a syntax that is much closer to practical functional programming languages than the pure typed lambda calculus! For example, syntax for polarized calculi tends to have pattern matching. However, one price of this is a proliferation of judgements. We usually end up introducing separate categories of values (for introducing positive types) and spines (argument lists for elminating negative types), as well as terms (how to put values and spines together in computations, as well as introducing negative types) and patterns (how to eliminate positive types).

Now, let's talk through the rules. First up is the \(\check{\Gamma}{v}{P}\) judgement for checking the type of positive values.

\[\begin{matrix} \rule{} { \check{\Gamma}{()}{1} } & \rule{ \check{\Gamma}{v}{P} & \check{\Gamma}{v'}{Q} } { \check{\Gamma}{(v,v')}{P \times Q} } \\[1em] \rule{ \check{\Gamma}{v}{P_i} & i \in \{1,2\} } { \check{\Gamma}{\inj{i}{v}}{P_1 + P_2} } & \rule{ \checkn{\Gamma}{t}{N} } { \check{\Gamma}{\thunk{t}}{\downshift{N}} } \end{matrix}\]

The rules for units, pairs and sums are the same as always. The rule for downshift says that if a term \(t\) checks at a negative type \(N\), then the thunked term \(\thunk{t}\) will check against the downshifted type \(\downshift{N}\).

We'll see the rules for terms in a bit, but next will come the rules for spines, in the judgement \(\spine{\Gamma}{s}{N}{M}\). This judgement says that if the spine \(s\) is applied to a head of type \(N\), it will produce a result of type \(M\). In this judgement, the type \(N\) is an algorithmic input, and the type \(M\) is an output.

\[\begin{matrix} \rule{ } { \spine{\Gamma}{\cdot}{N}{N} } \qquad \rule{ \check{\Gamma}{v}{P} & \spine{\Gamma}{s}{N}{M} } { \spine{\Gamma}{v\;s}{P \to N}{M} } % \\[1em] % \rule{ \forall i < n.\; & \checkp{p_i:P}{\Delta_i} & \checkn{\Gamma, \Delta_i}{t_i}{M} } % { \spine{\Gamma}{\match{x \cdot s}{\overrightarrow{\arm{p_i}{t_i}}^{i < n}}}{\upshift{P}}{M} } \end{matrix}\]

The first rule says that if you have an empty argument list then the result is the same as the input, and the second rule says that if \(v\) is a value of type \(P\), and \(s\) is an argument list sending \(N\) to \(M\), then the extended argument list \(v\;s\) sends the function type \(P \to N\) to \(M\).

With values and spines in hand, we can talk about terms, in the term typing judgement \(\checkn{\Gamma}{t}{N}\), which checks that a term \(t\) has the type \(N\).

\[\begin{matrix} \rule{ \check{\Gamma}{v}{P} } { \checkn{\Gamma}{\return{v}}{\upshift{P}} } \qquad \rule{ \forall i < n.\; & \checkp{p_i:P}{\Delta_i} & \checkn{\Gamma, \Delta_i}{t_i}{N} } { \checkn{\Gamma}{\fun{\overrightarrow{\arm{p_i}{t_i}}^{i < n}}}{P \to N} } \\[1em] \rule{ x:M \in \Gamma & \spine{\Gamma}{s}{M}{\upshift{Q}} & \forall i < n.\; \checkp{p_i:Q}{\Delta_i} & \checkn{\Gamma, \Delta_i}{t_i}{\upshift{P}} } { \checkn{\Gamma}{\match{x \cdot s}{\overrightarrow{\arm{p_i}{t_i}}^{i < n}}}{\upshift{P}} } \end{matrix}\]

The rule for \(\return{v}\) says that we can embed a value \(v\) of type \(P\) into the upshift type \(\upshift{P}\) by immediately returning it. Lambda abstractions are pattern-style -- instead of a lambda binder \(\lam{x}{t}\), we give a whole list of patterns and branches \(\fun{\overrightarrow{\arm{p_i}{t_i}}}\) to check at the type \(P \to N\). As a result, we need a judgement \(\checkp{p_i:P}{\Delta_i}\) which gives the types of the bindings \(\Delta_i\) of the pattern \(p_i\), and then we check each \(t_i\) against the result type \(N\).

The match statement \(\match{x\cdot s}{\overrightarrow{\arm{p_i}{t_i}}}\) also has similar issues in its typing rule. First, it finds a variable in the context, applies some arguments to it to find a value result of type \(\upshift{Q}\), and then pattern matches against type \(Q\). So we check that the spine \(s\) sends \(M\) to the type \(\upshift{Q}\), and then check that the patterns \(p_i\) yield variables \(\Delta_i\) at the type \(Q\), and then check the \(t_i\) against the type \(\upshift{P}\).

Restricting the type at which we can match forces us to eta-expand terms of function type. Also, these rules omit a side-condition for pattern coverage. (I have an old blog post about how to do that if you are curious.)

Both lambda-abstraction and application/pattern-matching need the judgement \(\checkp{p:P}{\Delta}\) to find the types of the bindings. The rules for these are straightforward: \[\begin{matrix} \rule{ } { \checkp{\thunk{x} {:} \downshift{\!N}}{x:N} } & \rule{ } { \checkp{\unit : 1}{\cdot} } \\[1em] \rule{ \checkp{p_1 : P_1}{\Delta_1} & \checkp{p_2 : P_2}{\Delta_2} } { \checkp{(p_1,p_2) : P_1 \times P_2}{\Delta_1, \Delta_2} } & \rule{ \checkp{p:P_i}{\Delta} & i \in \{1,2\} } { \checkp{\inj{i}{p} : P_1 + P_2}{\Delta} } \end{matrix}\]

Units yield no variables at type \(1\), pair patterns \((p_1, p_2)\) return the variables of each component, coproduct injections \(\inj{i}{p}\) return the variables of the sub-pattern \(p\), and thunk patterns \(\thunk{x}\) at type \(\downshift{N}\) return that variable \(x\) at type \(N\).

At this point, it sure looks like we have a perfect bidirectional type system for a polarized calculus. What's the problem? The problem is that I palmed a card! Here's the relevant bit of the grammar I kicked things off with: \[\begin{matrix} \ldots \\ \mbox{Contexts} & \Gamma,\Delta & ::= & \cdot \bnfalt \Gamma, \color{red}{x:N} \\ \ldots \end{matrix}\]

The context \(\Gamma\) has been restricted to only contain variables of negative type. It doesn't allow variables of positive type! And, I don't know how to add it in the "right" way. If we wanted positive variables (in fact, call-by-push-value only has positive variables), we could add them in the following way:

\[\begin{matrix} \mbox{Values} & v & ::= & () \bnfalt (v,v) \bnfalt \inj{i}{v} \bnfalt \thunk{t} \bnfalt \color{red}{u} \\ \mbox{Contexts} & \Gamma,\Delta & ::= & \cdot \bnfalt \Gamma, x:N \bnfalt \Gamma, \color{red}{u:P} \\ \mbox{Patterns} & p & ::= & () \bnfalt (p,p') \bnfalt \inj{i}{p} \bnfalt \thunk{x} \bnfalt \color{red}{u} \\[1em] \end{matrix}\] So we add value variables \(u\) to the syntax of values, and so we have to also add them to contexts, and also extend pattern matching with them to bind values. Then, the rules for these things would look like the following: \[\begin{matrix} \rule{ } {\checkp{u:P}{u:P}} & \rule{ u:Q \in \Gamma & Q = P } { \check{\Gamma}{u}{P} } \end{matrix}\]

So a variable pattern at value type simply binds the variable at that type, and when we use a value variable we have the check that the type in the context matches the type that we're checking the term at.

And that's the wrong thing to do! The bidirectional recipe says that we should check equality of types only when we switch between checking and synthesis, and so while this rule might or might not work, it's clearly not arranged the information flow properly, since we have a random-looking subsumption test in the value variable rule.

Some additional idea is needed, and I'm not sure what, yet.

Monday, August 6, 2018

Category Theory in PL research

Over on Twitter, Max New wrote:

Maybe the reason category theory seems useless for PL to many people is that it only gets really useful when you start doing "sophisticated" stuff like internal, enriched and fibered categories and the intros for computer scientists don't get to those things.

Max is a very sharp person, but I do not agree with this!

The most useful applications of category theory in my own research have all been very, very elementary, to the point that people who (unlike me) are genuinely good at category theory are always a little surprised that I can get away with such primitive machinery.

Basically, the key idea is to formulate models of languages/libraries in terms of categories of structured sets with structure-preserving maps (aka homomorphisms). And, well, that's it!

I don't have a really crisp explanation of the process of figuring out how to go from "I have a property I'd like to represent mathematically" to "sets should have the following structure and the following properties should be preserved by the homomorphisms", but I can give an example to illustrate the recipe.

  1. Let's suppose we want to control the memory usage of programs.

  2. Think of a type as a set, together with some operations talking about the property in question. In this case, let's say that a type is a pair \((X \in \mathrm{Set}, w_X : X \to \mathbb{N})\), where \(X\) are the elements of the type and the "weight function" \(w_X\) gives you a number for each value telling you how much memory gets used.

  3. Now, let's define what the functions \(f : (X, w_X) \to (Y, w_Y)\) should be. Since values are sets, obviously a function between two types should be a function \(f : X \to Y\) on the underlying sets. However, we have a weight function for each type, and so we should do something sensible them. The first, most obvious idea is to say that maps should preserve the weight exactly -- i.e., that \(w_Y(f(x)) = w_X(x)\).

    This means that memory is never allocated, but it also means that memory is never de-allocated. So since we probably do want to allow deallocation, it makes sense to weaken this condition to something more like \(w_Y(f(x)) \leq w_X(x)\).

  4. Now, we go off and prove things about this language. In short order we will discover that this is a monoidal closed category, and so we can give a linearly-typed language for memory-controlled programs.

This is basically the idea in the late Martin Hofmann's 2003 paper, Linear types and non-size-increasing polynomial time computation. He called this category the category of "length spaces". This paper was a revelation to me, because it was the first time that I really understood how plain old ordinary mathematics could be used to model genuinely computational phenomena.

There are a lot of examples like this -- here are four more I like:

  • Domains and continuous functions

    The great grand-parent of using structured sets to model computations is, of course, domain theory. In this setting, a type is a set along with a complete partial order on it, and functions must be continuous with respect to this order. The intuition is that an always-nonterminating program is the least informative program, and that as programs loop on fewer and fewer inputs, they get more and more defined. Continuity basically means that giving a program a more-defined input will never make it loop more often.

    Steve Vickers's book Topology via Logic was how I learned to think about domains in a way that brought the computational intuition to the forefront.

  • Nominal sets and equivariant maps

    In languages like ML, Haskell and Java, it is permissible to create pointers and test them for pointer equality, but it is not permissible to compare pointers for order (you can use == but you can't use <). This means that the order in which pointers are created is not observable, a fact which is very important for both the equational theory and optimization of programs.

    To model this, we can introduce a set of locations (aka names, or atoms). To model the fact that we can't observe the order, we can say that we want the meanings of programs to be independent of permutations of the names -- as long as we don't identify two locations, the program can't tell if we reorder the locations in the heap.

    So, we can abstractly model the idea of reordering locations by saying that a type is a pair \((X, a : Perm \times X \to X)\), where \(X\) is a set of values, and the "action" \(a\) tells you how to rename all the values in the set when you are given a permutation. Then, a morphism \(f : (X, a) \to (Y, b)\) will be a function such that \(b(\pi, f(x)) = f(a(\pi, x))\) -- that is, we get the same answer whether we rename the argument to \(f\) or apply the function and then rename.

    I learned this from Andy Pitts's work on nominal logic -- here is a nice introduction for SIGLOG news which he wrote.

  • Partial orders and monotone maps

    My student Michael Arntzenius has been looking at higher-order database query languages, and one of the ideas we have been working with (in the language Datafun), has been inspired by database query languages and dataflow analysis, where recursive queries are formulated as fixed points of monotone functions on partial orders.

  • Ultrametric spaces and non-expansive maps

    In joint work with Nick Benton, we used the category of (wait for it) complete 1-bounded bisected ultrametric spaces and non-expansive maps to model reactive computations.

    Our goal was to model reactive programs, which have the property of causality. That is, an interactive system can only produce an output at time \(n\) from the first \(n\) inputs -- it can't look into the future to make an action.

    We began with this standard definition of causality, which was defined on streams, and then -- after reading the classic paper of Alpern and Schneider Defining Liveness -- we noticed that it corresponded to nonexpansiveness when streams are given the Cantor metric (i.e., the distance between two streams is \(2^{-n}\) when they first differ at the \(n\)-th position). This led us to look at categories of metric spaces, and we were able to adapt some work of Lars Birkedal and his collaborators to devise a new language for reactive programming.

One point I want to make is that while this is not a mechanical process, it is also "ordinary mathematics". In the last two cases, we had to try very hard to understand the problems we were trying to solve, but we didn't need vast amounts of categorical machinery. We really only needed elementary definitions from category theory, and we needed it because it was the technology that let us recognise when we had actually found a nice algebraic solution to our problem.

Friday, August 3, 2018

The Worm Ouroboros

Here's a quick question: can the behaviour of the following bit of C code depend upon the semantics of the Python programming language?

1
2
3
4
5
6
7
int sum(size_t len, char *array) { 
  int total = 0;
  for (int i = 0; i < len; i++) { 
    total += array[i];
  }
  return total;
}

Now, the very fact that I am asking this question probably makes you suspicious. (This suspicion is correct and natural, and incidentally also indicates that you basically understand Derrida's critique of the metaphysics of presence.) Luckily, as lecturers go, I'm a very generous grader: I'll accept both no and yes as answer!

The "no" answer is easy enough to explain: the C language specification defines an (informal) abstract machine, and the behaviour of the sum function can be explained entirely in terms of that abstract machine. Python never shows up in this semantics, so how on Earth could it possibly be relevant to the behaviour of this code?

To get to "yes", let's look at another bit of C code (please forgive the total lack of error handling, as well as the absence of all includes):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
int foo(char *filename) { 
   int fd = open("foo.txt", O_RDONLY);  // open a file
   struct stat s;                        
   fstat(fd, &s);                      // we only want s.size
   //
   // memory-mapped IO (every time, I remember Perlis epigram 11)
   //
   char *array = 
    (char *) mmap(NULL, s.size, PROT_READ, MAP_SHARED, fd, 0);  
   //
   // Now we can add up the byte values:
   int total = sum(s.size, array);
   //
   // We're done with the file now.
   munmap(array, s.size);
   close(fd);

   return total;
}

This function will open a file "foo.txt", and then will get the file's size using fstat, and then use that size information to call mmap, mapping that file into memory, storing the start pointer at array. Then we'll call sum to total the bytes of the array. But note that:

  1. The semantics of array depend on the semantics of the file system.
  2. The file system might be a user-mode file system, implemented using something like FUSE or WinFsp.
  3. You could (and various people have) written user-mode filesystems to expose a Mercurial repository as a file system.
  4. Mercurial is implemented in Python.

Let me emphasize: on a modern computer, a pointer dereference could very lead to the execution of Python code. Indeed, seeing an assembly listing, like this one:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sum:
        test    edi, edi
        jle     .L4
        mov     rdx, rsi
        lea     eax, [rdi-1]
        lea     rsi, [rsi+1+rax]
        mov     eax, 0
.L3:
        movsx   ecx, BYTE PTR [rdx]
        add     eax, ecx
        add     rdx, 1
        cmp     rdx, rsi
        jne     .L3
        rep ret
.L4:
        mov     eax, 0
        ret

offers no guarantee that anything "close to the metal" will actually happen. This is the normal state of affairs, which is almost certainly true about every computer upon which this article is being read. (If it's not, you've probably somehow compiled your browser into a kernel module, which is security-wise an incredibly reckless thing to do.)

You might wonder, what's the moral of this story? Actually, there isn't a moral -- instead, there's a question. Namely, how can we prove that this kind of thing is a sensible thing to do, ever? The key intellectual difficulty is that the rationale for for why this kind of program can work appears to be circular.

In general, the easy case of abstraction is when things build up in layers. We start with a piece of hardware, and on this hardware we run an operating system, which (by running in kernel mode to access various hardware primitives) implements address space virtualization. Since user-moder code can never touch the hardware primitives to manipulate things like page tables, we can conclude (assuming the implementation of virtual memory is correct) that the virtual memory abstraction is seamless -- user code can't tell (without relying on OS calls or side-channels like timing) how memory is implemented.

However, what memory-mapped IO and user-mode file systems let you do is to move part of the implementation of the memory abstraction into user space. Now the correctness of the Python implementation depends upon the correctness of the virtual memory, but the correctness of the virtual memory abstraction depends upon the correctness of the Python implementation!

If A implies B, and B implies A, then we are not licensed to conclude tha A and B hold. (This is because it's bad reasoning! Since false implies itself and false implies everything, if this principle were true in general then everything would be true.) And yet: something very close to this kind of reasoning is essential for building systems from modular components, or even writing concurrent algorithms. (We call it rely-guarantee, usually.)

Wednesday, July 25, 2018

A Typed, Algebraic Approach to Parsing

(Note: If you're on the POPL PC, please don't read this post until after your reviewing is done.)


It's been over 20 years since Graham Hutton and Erik Meijer wrote their wonderful JFP paper on monadic combinator parsing. Do we need another paper about it?

Jeremy Yallop and I think the answer is yes! We have a new draft paper out, A Typed, Algebraic Approach to Parsing.

The fundamental issue is that parser combinators are very pretty, but suffer from two serious problems.

  1. First, parser combinator librariess tend to be implementation-defined.

    Context-free grammars have the very nice property that it's very easy to exactly define the set of strings that a parser for that grammar should match. This, in turn, makes it easy to prove that there are many grammar transformations which do not change that set. For example, if the BNF production A recognizes a language, and B recognizes a different language, then the production A | B will the union of the two languages recognized by A and B. As a result, swapping the order of an alternative from A | B to B | A will not change the language recognized.

    However, most implementations of parser combinators do not satisfy these kinds of properties! Swapping the order of the alternatives usually can change the parse tree returned from a parser built froom parser combinators. This mismatch between the high-level explanation and what the code actually does makes debugging and refactoring harder, and also makes it hard to understand what a combinator parser does without knowing the actual implementation of the library.

  2. Next, combinator parsing tends to be inefficient.

    The standard parser combinator algorithm is usually some variant of backtracking recursive descent, and that means that parse times can become exponential in the length of the input. Basically, in the worst case you have to save a backtrack point after seeing each character, which gives you exponentially many branches you may have to explore.

    There are two general classes of reaction to this problem. First, the combinator library can explicitly expose backtracking primitives to the user of the library, so that programmers can control when backtracking can occur. This does resolve the problem, but a drastic price: there is basically no specification of the accepted language beyond the actual code of the parser implementation.

    The second (and somhewat better) reaction is to switch to another parsing technology. The most notable such alternative is PEG (aka packrat) parsing. Now, the good thing about packrat parsing is that actually does have predictable linear-time parsing, but that comes at a pretty high cost: the memory usage of PEG parsers is linear in the input, and choice is not commutative.

Now, this all sounds like a pretty strong argument for sticking with traditional parser generators, whether LR (eg, yacc) or LL (eg, ANTLR). A BNF spec has a clear declarative reading, and the generated parsers are efficient, so what's not to like? Two things:

  1. Honestly, the BNF notation for context-free grammars has its own weaknesses. For example, BNF has no notion of binding or variable, and this makes it quite hard to build grammars in a modular fashion. For example, it's a good idea for all the sequence constructions in a language to uniformly treat delimiters as separators or terminators, but with BNF you have to manually define each sequence type, making it easy to screw this up. If you had the ability to build abstractions, you define a single sequence construction primitive which would let you make the choice once and for all. This is easy to do with parser combinators, but missing from most parser generators (Menhir being a notable exception).

  2. Furthermore, a problem common to all table-generating parser generators is that their error reporting is horrible -- when there is a bug in the grammar taking you out of the handled language class, the parser generator vomits its internal data structures over the programmer, who has to then pick through the chunks to figure out what the problems were. It would be much better if errors could be reported in terms of the actual grammar the programmer wrote!

So Jeremy and I basically set out to fix all these problems. Our paper proceeds via the following steps:

  1. First, we found a better notation for context-free languages than BNF. Basically, if you take regular expressions and add a least fixed operator to them, then you can recognize exactly the context free languages. This is not a new observation; it's been known since at least the early 1970s. But the payoff of replacing nonterminals with fixed point operators is that there is now a very clear story on variable scope and binding.

  2. As most of you know, we don't actually want to allow arbitrary context-free languages, since some of them are inherently ambiguous. So the next thing we do is we define a type system for our context-free expressions, which can statically identify unambiguous grammars which can be parsed efficiently (ie, by non-backtracking recursive descent with a single token of lookahead).

    The benefit of this is that type checking is local and syntactic, and so all our errors can be reported in terms of the grammar that the programmer wrote.

  3. Next, we define a family of parser combinators which operate on typed grammars. Our parser combinators have a very simple implementation story -- there's no backtracking and no fancy lookahead, so the type of parsers is as simple as can be. Moreoever, we can exploit the type information when parsing alternatives, by using a bit of lookahead to decide which branch to take.

  4. Our parser combinators have very predictable performance, but are still fairly slow, due to all the overhead of indirecting through function pointers (due to all the higher-order functions involved). So we use staged programming to eliminate this overhead. Basically, staging lets us eliminate all the overhead, leading to generated code which looks more or less exactly like the kinds of hand-coded recursive descent parsers that experts write by hand.

The resulting parsers have excellent performance -- in our benchmarks, we outperform ocamlyacc-generated code.

So we get the abstraction benefits of parser combinators, good static error reporting about ill-formed grammars, and excellent performance to boot!

Tuesday, April 24, 2018

Are functional programs easier to verify than imperative programs?

Via Michael Arntzenius (@arntzenius) on Twitter, I saw that Hillel Wayne posted the following remark:

Lots of people say "FP is easier to analyze than imperative code because of purity" but whenever I ask for evidence people look at me like I'm crazy. So I'd like to make a challenge: I'll provide three imperative functions, and your job is to convert them into pure functions.

If you follow the link, you'll see his three programs. Each of them is basically a small imperative program, of the kind you might see a Python programmer write. I'm not going to post any correctness proofs of them, since my interest is rather to comment upon the difficulty of verification. Since I am too bad at concision to fit thoughts into 280 characters, I'm doing this via a blog post. :)

The high-order bit is that the three programs he suggested are equally easy/hard to verify in imperative and functional styles. This is because none of the programs make any significant use of procedural abstraction.

The difficulty of imperative programming arises from the combination of state, aliasing and procedure calls. Any two of these features can be handled without that much difficulty, but the combination of all three effectively makes reasoning (nearly) as difficult as correct concurrent programming.

At a high level, reasoning is easy (more honestly, feasible) when each construct in your language has a compositional specification -- when you can reason about the behaviour of a subprogram just in terms of its specification, rather than looking at its full source code each time.

Hoare logic gives you a compositional reasoning technique for reasoning about imperative updates. When data can be aliased, a specification must also say something about the aliasing expected by your program. The current best technique for these specifications is a generalization of Hoare logic called separation logic. (I should say something about how it works, but this post is already too long!)

The addition of full procedures makes the problem much more challenging. This is because the specification of a procedure call must state its effect, without saying anything about the specific program points from which it will be called. Otherwise you might as well have just inlined the function, and prove the correctness of a fully procedure-free program. Indeed, Leslie Lamport advocates doing just this, in his paper Composition: A Way to Make Proofs Harder. This is actually a reasonable position, if you are interested in verifying algorithms (which exist in isolation) rather than verifying programs (which always exist as parts of a larger program).

Intuitively, the difficulty is a lot like reasoning about cooperative concurrency -- when you call an arbitrary procedure, it's a bit like yielding to the scheduler and having it execute some other piece of code for a while before returning control to you. For this logical/conceptual concurrency to actually accomplish anything, the unknown code hidden in the procedure call has to do something to your data, but at the same time can't break any of your invariants. So you need to reason about data invariants, object ownership, and temporal protocols, all at the same time.

Matters get yet more complicated yet if you want to specify programs which pass around genuinely unknown functions (for instance, storing pointers to functions in the heap, a device which many programmers call "objects"). In this case, you can't inline the functions, because you don't know what they will be!

But simply rejecting OO (or higher-order state) doesn't work, because plenty of important programs rely upon it. For example, if you want to prove that your scheduler implementation implements a fair strategy, you have to come up with a protocol between the scheduler and the code that compiler writers generate. Similar difficulties arise if you want to prove that if your compiler generates a bunch of provably correct .o files, the whole program actually does work when they get linked together. (For example, when will a C program work, if you link it to a new malloc/free implementation?)

Removing any one of the three difficulties makes things easier, but still leads to interesting systems:

  • a) State + procedure calls ==> Idealized Algol. Idealized Algol is a language with higher-order functions and mutable variables, but lacking pointers. John Reynolds invented a beautiful generalization of Hoare logic (called specification logic) to reason about it in the early 80s. It's not well-known any more, but I really like it.
  • b) State + aliasing ==> basically, pointer programs without procedures, John Reynolds also invented the state-of-the-art technique for reasoning about these programs (called separation logic) in the early 2000s.
  • c) Aliasing + procedures ==> purely functional programming. Basically, we don't have to care who else can see our data if they can never change it. In what you might see as a running theme, John Reynolds also devised many of the techniques we use for reasoning about purely functional programs (such as parametricity) in the late 1970s and early 1980s.

Some related observations:

  1. This is, incidentally, why reasoning about performance of Haskell code is much harder than reasoning about its functional correctness. The value of a Haskell program never depends on any other parts of the program, which gives you a simple compositional model for functional correctness.

    But whether forcing a thunk takes time or not depends on whether other parts of your program have forced it already. So reasoning about the performance of Haskell code requires complex reasoning about aliasing. This is one reason why newer purely functional languages, such as Idris, Purescript, and Koka, are all call-by-value rather than call-by-need -- it makes their performance model compositional.

  2. Note that for building a real program, procedural decomposition is not optional. You just can't do without being able to break your program into pieces. Since this feature is basically non-negotiable, aliasable state becomes very dangerous. This is why people say that "imperative programming is harder than functional programming" -- the unstated assumption is that you have functions, and that state is aliasable pointer-like state.

    However, by making the treatment of state a little less lax, you can retain the ease of reasoning while still permitting the controlled use of state.

  3. The safe fragment of Rust is a mix of a) and c) -- it permits you to create aliases of pointers to data only when you don't use them to mutate. (Even though I doubt this was a direct inspiration, John Reynolds also pioneered this approach with his 1978 paper Syntactic Control of Interference.. (ACM link, sorry.) Peter O'Hearn wrote a paper Syntactic Control of Interference, Revisited with a modernized approach to the semantics and typing rules.)

    If you add unsafe, then you need very fancy modern variants of separation logic to reason about the result. See RustBelt: Securing the Foundations of the Rust Programming Language, by Ralf Jung, Jacques-Henri Jourdan, Robbert Krebbers, Derek Dreyer. (Don't feel obliged to look at this unless you are very interested -- the work is excellent but unavoidably very technical.)

    Indeed, semantically Haskell is very similar to Rust+unsafe in this regard -- the runST operator lets you create computations that use highly aliasable state, as long as it doesn't escape the scope of a computation. And the same fancy separation logic that works for proving the safety or Rust is needed to show the safety of Haskell! See A Logical Relation for Monadic Encapsulation of State: Proving contextual equivalences in the presence of runST by Amin Timany, Leo Stefanesco, Morten Krogh-Jespersen, and Lars Birkedal. (Both of these papers appeared at the same conference!)

Sunday, September 18, 2016

HOPE 2016 Talk Summaries

I'm attending ICFP 2016 in Nara, Japan (near Osaka) this year. In previous years, I've really enjoyed Edward Yang's conference talk summaries, and decided that this year, I would take notes about the talks I attended and put them online myself.

Today, I attended the HOPE 2016 workshop on higher-order programming with effects, and took the following notes. These are all rough notes, taken in haste and based on a highly-jetlagged listener's poor listening comprehension. So assume all errors and misunderstandings are my fault!

Keynote (Leo White)

Due to jetlag, I overslept and missed it! Alas.

Effects

  • Effects via capabilities, Fengyin Liu (speaker)

    The idea behind this work is to introduce explicit capability values that enable or disable effects. So you might have a capability c : IO, which must be passed to all IO operations. If you don't have the capability to pass to the IO operation, you can't invoke them.

    print : IO -> string -> unit

    Naively, though, there is an issue arising from the ability of closures to capture effect capabilities.

    So two function spaces are introduced A => B, which is the usual ML-style function space, and A -> B, which is a function space which cannot capture capabilities or ML-style functions.

    Interestingly, the full substitution lemma doesn't hold: you have to prove a value substitution theorem instead. This suggests to me that there might be a modal variant of this system, Davies-Pfenning style.

    Liu suggested some type equivalences like A -> (B => C) == A -> B -> C, because the outer -> prevents the capture of capabilities. They can't prove the soundness of this by induction, of course, so I suspect they'll eventually want a logical relation to establish soundness. (A logical relation for Dotty would be exciting -- maybe chat with Amal Ahmed & Nada Amin about this?)

    Polymorphism introduces some subtleties (ie, you might not know whether a type variable will be instantiated to a pure type), and they also worry about verbosity in practice (which might hinder adoption by Scala programmers).

  • A Logical Acount of a Type-and-Effect-System, Kasper Svendsen (speaker)

    They have a parallel ML-style calculus with reads, writes and CAS. Locations are approximated in the type system with regions.

    • The region context is split into a public and private part, to control which locations you can get interference on.

    • Function types are annotated with which public, which private regions, and which read/write effects they have.

    • New regions can be introduced with dynamic extent. Private regions can be shared temporarily during parallel composition.

    • The type system makes region variables implicit -- this info is only in the typing derivation. So all terms look like ML terms.

    The goal is to prove the correctness of program transformations which rely on the semantics of effects. For example, a sequential composition can be equated to a parallel composition

    e; e' == e || e'

    when there is no interference between these two terms.

    He explains the intuition with binary logical relation, which are formulated within the Iris program logic (paper at POPL 2015). Here, the logical relation is a per-type family of simulation relation, encoded as Hoare triples in a unary program logic.

    Here, they use a really neat feature of Iris. Since Iris has any monoidal resource, they use simulation relations as resources! Then they express the parallel composition via a separating conjunction of simulations, one for e and one for e'. The fact that e and e' don't interfere means that e' steps are stuttering steps for e, and vice-versa. They even let simulation relations share state, when the state is read-only.

  • Simple Dependent Polymorphic IO Effects, Amin Timay (speaker)

    They want a logical relation for a type-and-effect system for IO. The IO effects are like regexps, but have effect variables, channels, and quantification over what is written -- eg, cat would have the effect:

    (forall x. Rda; Wrb)*

    to indicate that it writes what it reads from a to b. They want the relation to relate (say) that a buffered implementation of cat refines the unbuffered version, and also to validate the parallel composition transformation for non-interfering operations.

    The language of effects is very rich, so I don't have a good feel for what it can and can't express. Eg, the reasoning for the buffering optimization was entirely type-based, based on the effect subtyping order!

Verification

  • Concurrent Data Structures Linked in Time, German Delbianco (speaker)

    This talk was about Hoare-style verification of highly concurrent data structures, such as "concurrent snapshots".

    Eg, you have a shared array a, and a function write(i,x) which makes an atomic update to the array, and a function scan(), which returns a snapshot of the state of the array at a particular instant.

    He then presented Jayanti's algorithm for this, and gave a nice intuition for why the right notion for correctness of the snapshots was linearizability. What made the algorithm tricky is that the linearization points for the algorithm are not fixed -- they are not fixed (they can be vary with the runtime inputs).

    Then he introduced the notion of events in the ghost state, which are organized for each thread into subject and environment events, and which are each connected to an instant of time. The variation in where linearization points occur could then arise via modifications of when each event happened -- which is an update on the "linked list" of time, which can be modelled as ghost state!

    This has all been verified in FCSL, a Coq-based program logic that Ilya Sergey and Aleks Nanevski are some of the main forces behind.

    Ralf Jung asked if they had verified any clients, and German said they had used it for a few small clients, but nothing big. Ilya said that people just assumed snapshots were "obviously" good, but no one had any really good/complex examples of them. He said that they hoped to figure some out now that they had a declarative spec of them.

  • Growing a Proof Assistant, William Bowman (speaker)

    WB began with an example of some notations for an informal type soundness proof. He pointed out that some of the notations (like BNF) are formal, and others (like defining let-binding in terms of lambdas) are informal.

    Then he pointed out that there are tools like Ott which compile BNF to Coq/Agda/LaTeX, and that most proof assistants (like Coq) have APIs to define notations, tactics, and plugins. Then he said that this wasn't enough, and that he wanted to build a proof assistant focused on extensibility.

    The core type theory was TT (basically MLTT with a countable universe hierarchy), and the extensibility framework was Racket (to reuse as much of the macro machinery as possible).

    Then he began explaining how cur could be used to implement pattern matching. Things like recursive functions required compile-time state to communicate information from outside the local syntax tree.

    Then he moved on to defining tactics, and gave a small language for this. He needed to support fully elaborating terms (so that user tactics could be defined solely on kernel syntax), and also a way to run cur code at expansion-time to support Mtac style tactics.

    This made me wonder about phase separation, but then he immediately put this and several other issues (such as how to preserve properties like parametricity while supporting reflection on syntax).

    Someone asked about how to combine extensions. WB replied that macros support this because of the architecture of Racket macros, which essentially support open recursion to permit multiple macros to run at once. He also remarked that this creates classical problems from OO like diamond inheritance.

Compilation

  • Algebraic Effects for Functional Programming, Daan Leijen (speaker)

    Daan Leijen gave a talk about adding algebraic effects to Koka. Koka had effect types based on monads, but (a) he wanted a better story for composition, and (b) he wanted to be able to handle exotic control operators (like yield, exceptions, and async/await) in a uniform way.

    He has a full compiler for effect handlers in Koka, with row-polymorphic effect typing. The compilation scheme uses a type-directed CPS translation to implement effect handlers targeting JS, .NET, JVM (which are rather direct-style). Then he talked a bit about the semantics and syntax of Koka to set the stage.

    Next, he talked about examples, starting with exceptions. He mentioned that row types in Koka permit duplicated lables. I wonder how Koka picks?

    I also wondered about using algebraic effects for the "mock object" pattern from OO testing.

    Then he showed iteration, and remarked that handlers permit the handler to control whether iteration stops or not. Then he showed asynchronous callbacks (Node-style), and showed that you could pass the handler's resume continuation as an argument.

    Then he talked about the compilation to managed code where first-class continuations don't exist. This uses a type-directed style to avoid using CPS when it doesn't need it. The main issue is with effect polymorphism, which he resolves by compiling it twice in both ways, and then using a worked-wrapper transformation to pick the right representation.

    SPJ asked if he depended upon a defined order of evaluation. Daan said this was about effects for Haskell, and handed off to Sam Lindley, who said that the situation in Haskell was the same as in Koka -- you have to say what the evaluation order is (?).

  • Administrative Normal Form, Continued, Luke Maurer

    CPS is good for optimization, but due to the costs, not everyone uses it. Flanagan pointed out that some CPS optimizations work in direct style, but it's not all of then.

    Eg, case-of-case often creates values that go to known call sites, which shouldn't be allocated. This work extends ANF to support these, and to some significant effect in heap allocation rates.

    Case-of-case can be solved by pushing in the context, and while there are engineering issues of code duplication, it can often be helpful. Many compilers do this.

    However, it doesn't work for recursive functions. In CPS this transformation happens "almost for free". That is, you can do the static argument transformation, and then inline the static continuation to get the beta-reductions to fire. This is "contification".

    So how does contification work in ANF? He notes that the static argument transformation only worked because the program was a tail call. So they tag functions which are always tail called (they are called "labels"). Then they add a new rewrite which moves outer cases into a label.

    He gave the operational semantics for join points, which are pretty sensible.

    Then he pointed out that it helps with list fusion. An unfold uses a step function state -> option (a x state), which lets you eliminate intermediate lists, but introduces useless Just/Nothing cruft. But the new label/case-of-case machinery handles this and reduces many of them, and some benchmarks kill all of them.

    This is in a new branch of GHC.

    I asked how they would know if they had all the new constructs they needed. SPJ said he didn't know any CPS optimizations that could not be handled now, but that he didn't have a proof. Someone asked about one benchmark which allocated less but ran more slowly. I didn't understand the answer, though. :( Sam Lindley asked about the connection to SSA and phi-functions. Luke said that labels are much like phi-functions in SSA.

Semantics

  • Functional models of full ground, and general, reference cells, Ohad Kammar

    How can we understand reference cells. They have multiple axes:

    • Locality -- freshness of newly allocated cells. (ie, ref 0 == ref 0 is always false)
    • Ground vs Non-ground -- can we point to computations which can modify the heap (ref (int -> int))?
    • Full storage -- semantics can depend on shape of the heap: (ref ref bool)

    This talk is about full ground state. Approach of the talk is denotational semantics for full ground state, to get an equational theory for programs. He wants to use it to analyze the value restriction, and relax it if possible. He also wants to validate runST.

    He found a bug in his semantics last night, so he will discuss the general techniques used rather than a particular model.

    • Levy introduced a trick to handle circular ground state, using names for reference types. (A la class names in OO).

    • Kripke worlds are a finite set of locations, and a name for each location. A world morphism is a name-preserving injection. Worlds have monoidal structure. The language lives in the functor category over worlds, and so can interpret CCCs, and references at a world w are interpreted as the set of locations of that type.

    • To model computations we need a monad

    To decide which equations we need, we have the usual deref/assignment/allocation equations. We don't have a complete axiomatization of state, so we have to just pick some. So they picked a nontrivial one, the equation for runST.

    Then he gave the wrong monad, which used ends to model stores, but I don't have a good feel for them. It validated masking, but not dereference. Then he gave a second monad which had too little structure, in that it modelled deref but not masking.

    Then he gave a small type system for runST. It interpreted each region label in runST as a world (as above).

    Lars Birkedal asked about which equations he was interested in, and Ohad named some of them. Daan Leijen said he had done an operational proof of runST and didn't like, and encouraged Ohad to keep going.

Monday, May 9, 2016

Löb's theorem is (almost) the Y combinator

As a computer scientist, one of the stranger sociological facts you'll encounter is that regular mathematicians tend to dislike it (and its cousin formal logic). The reason for that is nicely encapsulated in the following anecdote that Danny Hillis tells about Richard Feynman and the design of the Connection Machine computer:

[...] Richard had completed his analysis of the behavior of the router, and much to our surprise and amusement, he presented his answer in the form of a set of partial differential equations. To a physicist this may seem natural, but to a computer designer, treating a set of boolean circuits as a continuous, differentiable system is a bit strange. Feynman's router equations were in terms of variables representing continuous quantities such as "the average number of 1 bits in a message address." I was much more accustomed to seeing analysis in terms of inductive proof and case analysis than taking the derivative of "the number of 1's" with respect to time. Our discrete analysis said we needed seven buffers per chip; Feynman's equations suggested that we only needed five. We decided to play it safe and ignore Feynman.

The decision to ignore Feynman's analysis was made in September, but by next spring we were up against a wall. The chips that we had designed were slightly too big to manufacture and the only way to solve the problem was to cut the number of buffers per chip back to five. Since Feynman's equations claimed we could do this safely, his unconventional methods of analysis started looking better and better to us. We decided to go ahead and make the chips with the smaller number of buffers.

Fortunately, he was right. When we put together the chips the machine worked.

I've put the relevant bit in bold. To most mathematicians, inductive proofs are the tool of last resort. You always try to weave together the known properties of your object together in a creative way, and do an exhaustive case analysis only if all else fails. However, to logicians and computer scientists, induction is the tool of first resort – as soon as we have an object, we grind it to dust with induction and reassemble it bit by bit with the properties we want.

Regular mathematicians find this process roughly as appealing as filling in tax forms, but it does come with its compensations. One of those compensations is that we get to collect a wide variety of fixed point theorems, to justify an equally wide variety of recursive constructions.

The modal version of Löb's theorem is one of the more exotic fixed point theorems in logic. It is often glossed as representing the "abstract content" of Gödel's incompleteness theorem, which is the sort of explanation that conceals as much as it reveals. If you look at the proof, though, you'll see that its structure is a very close match for something very familiar to computer scientists – the proof of Löb's theorem is (almost) a derivation of the Y combinator!

To understand this, let's try formulating the fragment of provability logic we need as a natural deduction system. The types we will use are the following: $$ \begin{array}{lcl} A & ::= & b \bnfalt A \to B \bnfalt \mu a.\,A(a) \bnfalt \Box A \end{array} $$

This looks like the simply-typed lambda calculus, extended with recursive types and a special $\Box A$ type former. This is almost, but not quite, correct! $\Box A$ is indeed a modal operator -- in particular, it satisfies the axioms of the K4 modality. However, the recursive type $\mu a.\,A$ is not an ordinary recursive type, because unrestricted recursive types lead to logical inconsistency. Instead, $\mu a.\,A$ is a modal fixed point. That is, it satisfies the isomorphism: $$ \mu a.\,A \simeq A{[\Box(\mu a.\,A)/a]} $$

Hopefully, this will become a bit clearer if we present typing and inference rules for this logic. We'll use a variant of Davies and Pfenning's judgemental formulation of modal logic, modified to support the K4 modality rather than the S4 modality. The basic judgement will be $\judge{\Delta; \Gamma}{e}{A}$, where $\Delta$ are modal variables and $\Gamma$ are ordinary variables. $$ \begin{array}{ll} \rule{ x:A \in \Gamma} { \judge{\Delta; \Gamma}{x}{A} } & \\[1em] \rule{ \judge{\Delta; \Gamma, x:A}{e}{B} } { \judge{\Delta; \Gamma}{\fun{x}{e}}{A \to B} } & \rule{ \judge{\Delta; \Gamma}{e}{A \to B} & \judge{\Delta; \Gamma}{e'}{A} } { \judge{\Delta; \Gamma}{e\;e'}{B} } \\[1em] \rule{ \judge{\Delta; \Gamma}{e}{A{[\Box(\mu a.\,A)/a]}} } { \judge{\Delta; \Gamma}{\fold(e)}{\mu a.\,A} } & \rule{ \judge{\Delta; \Gamma}{e}{\mu a.\,A} } { \judge{\Delta; \Gamma}{\unfold(e)}{A{[\Box(\mu a.\,A)/a]}} } \\[1em] \rule{ \judge{\Delta; \Delta}{e}{A} } { \judge{\Delta; \Gamma}{\box(e)}{\Box A} } & \rule{ \judge{\Delta; \Gamma}{e}{\Box A} & \judge{\Delta, x:A; \Gamma}{e'}{C} } { \judge{\Delta; \Gamma}{\letv{\box(x)}{e}{e'}}{C} } \end{array} $$

The two key rules are the introduction and elimination rules for $\Box A$.

The elimination rule is the standard Davies-Pfenning rule. It says that if you can prove $\Box A$, you can add an $A$ to the modal context $\Delta$ and carry on with your proof.

The introduction rule is different, and weaker than the DP S4 rule. It says you can introduce a $\Box A$, if you can prove $A$ only using the modal hypotheses in $\Delta$. So the variables in $\Gamma$ are deleted, and the ones in $\Delta$ are copied into the ordinary hypothetical context. This setup is weaker than the DP introduction rule, and captures the fact that in K4 you can't derive the comonadic unit rule that $\Box A \to A$.

Once we have this, it becomes possible to derive Löb's theorem. Before we do that, let's look at the derivation of the Y combinator. To derive a fixed point at a type $P$, we introduce a recursive type $μa.\, a \to P$, and then we can add fold's and unfold's to the standard definition of the Y combinator:

let y : (P → P) → P = 
  λf. let f' = f in                   
      let g =                         
         (λr. let r' = r in          
              f' (unfold r' (r')))
      in 
      (g (fold g))     

You could also write this more compactly as λf. let g = λx. f (unfold x x) in g (fold g), but I have introduced some spurious let-bindings for a reason which will become apparent in a second.

Namely, virtually the same construction works with Löb's theorem! We will take the modal recursive type $\mu a.\,a \to P$, and then add some boxing and unboxing to manage the boxes (I'll annotate the variables with their types so that you can see what's going on):

let lob : □(□P → P) → □P = 
  λf. let box(f)' = f in                      // f' modal, □P → P
      let box(g) =                            // g modal, □(μa. a → P) → P)
         box(λr. let box(r') = r in           // r' modal, μa. a → P
                 f' box(unfold r' (box r')))  // application type P
      in 
      box(g (fold g))     

Note that if you took $\Box$ to be the identity constructor, and erased all the $\box$'s from this term, then this proof of Löb's theorem is actually exactly the same as Curry's Y combinator!

However, people more typically take $\Box$ to mean something like "quotation" or "provability". What proof shows is that there's nothing mysterious going on with self-reference and quotation -- or rather, it is precisely the same mystery as the universality of the lambda-calculus.

Postscript: One of the nice things about blogging is that you can be more informal than in a paper, but it's also one of the pitfalls. I got an email asking about the history of this system, and of course a blog post doesn't have a related work section! Basically, a few years ago, I worked out a calculus giving a syntax for applicative functors. An applicative is basically a monoidal functor with a strength, and I noticed that the modal logic K is basically a monoidal functor without a strength. There matters stood until last year I read Kenneth Foner's Haskell Symposium paper Getting a Quick Fix on Comonads, which was about programming with Löb's theorem. This was interesting, and after Wikipedia's proof of Löb's theorem, I saw that (a) the proof was basically Curry's paradox, and (b) I could adapt the rules for applicatives to model the modal logic K4, which could (with a small bit of tinkering) have a very similar calculus to applicatives. (In fact, the proof I give of Löb's theorem is a transliteration of the proof on Wikipedia into the system in this blog post!)

I should emphasize that I haven't really proved anything about these systems: I think the proofs should be straightforward, but I haven't done them. Luckily, at least for K, I don't have to -- recently, Alex Kavvos has put a draft paper on the arXiv, System K: a simple modal λ-calculus, which works out the metatheory of K with a dual-zone Davies/Pfenning-style calculus. If you need syntax for applicative functors minus strength, do take a look!