Monday, February 15, 2021

Five (and a Half) Derivatives in Language Theory

Thanks to the influence of machine learning, differentiating programs has become a big business. However, there are a lot of other things in programming language theory which are also derivatives, and people sometimes get confused about the inter-relationships. So I decided to lay out some of them and some of their interconnections.

1. Brzozowski Derivatives

The first kind of derivative is the humble, yet beloved, Brzozowski derivative. Given a character set \(\Sigma\) and a regular language \(L \subseteq \Sigma^\ast\), the Brzozowski derivative of \(L\) with respect to the character \(c \in \Sigma\) is

\[ \delta_c(L) = \left\{ w \in \Sigma^\ast \;\mid|\; c \cdot w \in L \right\} \]

That is, it's the subset of \(L\) prefixed with \(c\), minus the leading character. It can be defined as a transformations on regular expressions \(r\) as follows:

\[ \begin{array}{lcll} \delta_c(0) & = & \bot & \\ \delta_c(r_1 + r_2) & = & \delta_c(r_1) + \delta_c(r_2) & \\ \delta_c(c') & = & \epsilon & \mbox{when } c = c'\\ \delta_c(c') & = & \bot & \mbox{when } c \neq c' \\ \delta_c(\epsilon) & = & \bot & \\ \delta_c(r_1 \cdot r_2) & = & \delta_c(r_1) \cdot r_2 & \mbox{when $r_1$ not nullable} \\ \delta_c(r_1 \cdot r_2) & = & \delta_c(r_1) \cdot r_2 + \delta_c(r_2) & \mbox{when $r_1$ nullable} \\ \delta_c(r \ast) & = & \delta_c(r) \cdot r\ast & \\ \end{array} \]

The Brzozowski derivative is called a derivative for two reasons. First, it is linear with respect to \(+\) on regular expressions -- i.e., \(\delta_c(r_1 + r_2) = \delta_c(r_1) + \delta_c(r_2)\). Second, it satisfies the derivative rule for monomials \(\delta_c(c^{n+1}) = c^n\). You may recall that the high-school rule for derivatives says \(\delta_c(c^{n+1}) = n \times c^n\), but recall that in regular expressions, \(r + r = r\), so \(n \times r = r\).

However, this is a weak notion of derivative, which satisfies few of properties with other derivatives. The reason for this is that the rule for products does not match the product rule for high-school derivatives -- if the Brzozowski derivative satisfied the high school rule, it would be

\[ \delta_c(r_1 \cdot r_2) = \delta_c(r_1) \cdot r_2 + r_1 \cdot \delta_c(r_2) \]

2. Differential algebras

If we try to change the definition of Brzozowski derivative to match the high school rule, we get what is called a differential algebra. A differential algebra is a semiring \(S\), plus a collection of unary functions \(\delta_i : S \to S\) which are called the derivations. The axioms for derivations mirror the rules for differentiating polynomials -- the motivating model for differential algebras is a semiring of polynomials with addition and multiplication of polynomials as the semiring structure, and the derivation is the derivatives with respect to the polynomial's variable. The rules we get are:

\[ \begin{array}{lcll} \delta_c(0) & = & 0 & \\ \delta_c(r_1 + r_2) & = & \delta_c(r_1) + \delta_c(r_2) & \\ \delta_c(c') & = & \epsilon & \mbox{when } c = c'\\ \delta_c(c') & = & \bot & \mbox{when } c \neq c' \\ \delta_c(\epsilon) & = & 0 & \\ \delta_c(r_1 \cdot r_2) & = & \delta_c(r_1) \cdot r_2 + r_1 \cdot \delta_c(r_2) & \\ \delta_c(r \ast) & = & r\ast \cdot \delta_c(r) \cdot r\ast & \\ \end{array} \]

Note that the only things that changed is the clause for concatenation and the Kleene star. The axioms of a differential algebra actually say nothing about Kleene star, because they are specified for semirings, but to handle full regular languages you need to change the definition so that the differential respects the equation \(r \ast = \epsilon + r \cdot r\ast\). If memory serves, Dexter Kozen has studied this extension. (Even if memory doesn't serve, he's a safe bet...!)

If you look at this definition for a while, you realize that the original Brzozowski derivative removes a single character \(c\) from the front of a string, and the differential removes a single \(c\) from anywhere in the string.

For regular algebras, this is a peculiar thing to want to do, but it suddenly becomes a lot more interesting when we move from semirings of polynomials to semiring categories of polynomial functors.

3. Derivatives of Data Types

One interesting factoid is that sets "look like" a semiring. The Cartesian product \(A \times B\) and the disjoint union \(A + B\) obviously don't satisfy the semiring equations, but if you replace the equality symbol with isomorphisms, then they do. If you take this analogy seriously and try categorify the notion of a semiring, we get what is variously called a rig category, semiring category, or bimonoidal category.

Instead of a set plus two binary operations, we have a category with two monoidal structures with the semiring equations becoming isomorphisms. The exact commuting diagrams needed were worked out in the early 70s, independently by Laplaza and Kelly.

One semiring category very useful category for programming is the category of polynomial functors. Most inductive datatypes like lists, binary trees, and so on can be understood as the least fixed point of sum and products. We can formalise this idea by giving a grammar of sums, products and

\[ \begin{array}{lcll} F,G & ::= & \mathrm{Id} & \\ & | & \underline{0} \\ & | & F \oplus G & \\ & | & \underline{1} \\ & | & F \otimes G & \\ & | & K(A) & \mbox{where } A \in \mathrm{Set} \\ \end{array} \]

Every term of this grammar can be interpreted as a functor:

\[ \newcommand{\interp}[1]{[\![ #1 ]\!]} \begin{array}{lcll} \interp{F} & : & Set \to Set \\ \interp{\mathrm{Id}} & = & X \mapsto X \\ \interp{\underline{0}} & = & X \mapsto \interp{F \oplus G} & = & X \mapsto \interp{F}\,X + \interp{G}\,X \\ \interp{F \otimes G} & = & X \mapsto \interp{F}\,X \times \interp{G}\,X \\ \interp{K(A)} & = & X \mapsto A \end{array} \]

So you can see that \(\oplus\) is interpreted as the coproduct functor, \(\otimes\) is the product functor, \(K(A)\) is the constant functor, and \(\mathrm{Id}\) is interpreted as the identity functor.

The least fixed point of one of these polynomials is an inductive type. So to model lists of type X, we would define the list functor

\[ \mathrm{ListF} \triangleq K(1) \oplus (K(X) \otimes \mathrm{Id}) \]

and then take its least fixed point.

Since we have a semiring category, we can ask if it has a categorical analogue of a derivation, to give it differential structure. And these polynomials do! The derivative with respect to the variable \(\mathrm{Id}\) can be defined as

\[ \begin{array}{lcll} \delta(K(A)) & = & 0 & \\ \delta(\underline{0}) & = & 0 & \\ \delta(F \oplus G) & = & \delta(F) \oplus \delta(G) & \\ \delta(\mathrm{Id}) & = & 1 & \\ \delta(\underline{1}) & = & 0 & \\ \delta(F \otimes G) & = & \delta(F) \otimes G \oplus F \otimes \delta(G) & \\ \end{array} \]

If you have a functor for a binary tree:

\[ \mathrm{TreeF} \triangleq \underline{1} \oplus (\mathrm{Id} \otimes \mathrm{Id}) \]

Then the differential is: \[ \delta(\mathrm{TreeF}) = (\underline{1} \otimes \mathrm{Id}) \oplus (\mathrm{Id} \otimes \underline{1}) \]

This tells you whether the left or the right position was chosen. The utility of this is that given an element \(x \in X\), and you have an element of \(\delta(F)\,X\), then you can "fill in the gap" to get an element of X. That is, you can define:

\[ \mathit{fill}_F : \delta(F)\,X \times X \to F\,X \]

If you take \(X\) to be the type of trees \(\mu(\mathrm{TreeF})\), this gives you an indication of whether to take the left or right subtree, plus the tree you didn't follow. Then a list of these things gives you a path into a tree -- which is just Huet's famous zipper.

The exposition so far largely glosses Conor McBride's famous paper The Derivative of a Regular Type is the Type of its One-Hole Contexts.

The precise connection to linearity and derivatives was explored by Abbot, Altenkirch, Ghani and McBride in their paper Derivatives of Containers, using significantly more categorical machinery. In their paper, Altenkirch et al explain how the \(\mathit{fill}\) operation is really a linear function

\[ \mathit{fill}_F : \delta(F)\,X \otimes X \multimap F\,X \]

and how \(\delta(F)\) can be viewed as giving a kind of tangent for \(F\) at \(X\).

4. Differential Linear Logic

The appearance of linear types here should be delightful rather than shocking. We think of the ususal derivative as associating a tangent vector space to each point of an original space, and some of the primordial models of linear type theory are finite-dimensional vector spaces and linear transformations (a model of the multiplicative-additive fragment of classical linear logic, indeed also of the exponentials when the field the vector space is over is finite), and Banach spaces and short maps, a simple model of intuitionistic linear logic including the full exponential.

As a result, we should expect that we should be able to find a way of making differentiation makes sense within linear type theory, and since a type theory is the initial model of a class of models, this means that derivative-like constructions are something we should look for any time linearity arises.

The pure form of this recipe can be found in Erhard and Regnier's differential linear logic. One of the really cool features of this line of work is that derivatives arise as an extension to the usual structural rules for the derivatives -- you just adjust how to treat variables and derivatives appear like magic. One fact about this line of work is that it has tended to focus on proof nets rather than sequent calculus, which made it less accessible than one would like. What solved this issue for me was Marie Kerjean's A Logical Account for Linear Partial Differential Equations.

I do want to admit that despite starting off this section saying you shouldn't be shocked, actually I am -- this stuff really seems like sorcery to me. So even though I believe intellectually that everything should work out like this, I'm still astounded that it does. (It's rather like the feeling I get whenever I run a program I did a machine-checked correctness proof of.)

5. Automatic Differentiation

These days, the most famous kind of differentiation in language theory is, of course, automatic differentiation, which takes a program and rewrites it to produce a new program computing the derivative of the original program. Automatic differentiation comes in two main varieties, called forward mode and reverse mode.

The idea of forward mode is incredibly simple -- you just treat real numbers as an abstract datatype, and replace your original real number implementation with a new one based on dual numbers. Intuitively, a dual number is a pair \(a + b \epsilon\), where \(\epsilon\) is just a formal infinitesimal which is "nilpotent" (i.e., \(\epsilon^2 = 0\)).

Then the data abstraction properties of the typed lambda calculus ensure that when you run the program with the dual number implementation of reals, you are returned a pair of the original value and its partial derivatives (i.e. the Jacobian matrix). Dan Piponi blogged about this over 15 years ago(!), but even though I read it at the time, the algorithm was so simple I failed to recognise its importance!

Reverse mode is more complicated -- it comes from the observation that in machine learning, you are generally optimising a function \(R^n \to R\), where the result is some scoring function. Since \(n\) might be very large and the dimensionality of the result is just 1, you can achieve superior computational complexity by computing the transpose of the Jacobian rather than the Jacobian itself.

When you do so, the resulting syntactic program transformations all look like they are doing fancy continuation manipulations, and their semantic counterparts all look like manipulations of the dual space. So the syntatic continuation/double-negation operations are reflecting the fact that that finite-dimensional vector spaces are isomorphic to their double-duals (i.e., \(V^{\ast\ast} \simeq V\)).

In addition to being really neat, this is actually a really nice problem to look at. People basically care about reverse mode AD because it goes fast, which means you get this lovely interplay between operational concerns and semantic models. A second bonus is that AD offers computer scientists a good reason to learn some differential geometry, which in turn is a good excuse to learn general relativity.

6? Incremental Computation

After all this talk of differentiation, you might wonder -- what about finite differences?

It turns out that they have a nice interpretation in lambda calculus, too! In 2015, Yufei Cai, Paolo Giarusso, Tillman Rendel and Klaus Ostermann introduced the incremental lambda calculus.

The basic idea behind this is that for each type \(X\), you introduce a type of "changes" \(\Delta X\) for it too. A morphism between \((X, \Delta X)\) and \((Y, \Delta Y)\) is a pair of functions \(f : X \to Y\) and \(\mathit{df} : X \to \Delta X \to \Delta Y\), where \(\mathit{df}\) is the incrementalization -- \(df\,x\,\mathit{dx}\) tells you how the to change the output of \(f\),x to account for the change \(\mathit{dx}\) to the input.

This looks a lot like what you get out of the tangent space functor, but it turns out that the incremental lambda calculus is not simply a derivative a la the differential lambda calculus, because the derivative of a function is unique, and these incrementalisations don't have to be. Michael Arntzenius and I used the ILC in our paper on extending seminaive evaluation from Datalog to a higher-order language, and it was critical for the efficiency of our result that we had this freedom.

Mario Picallo-Alvarez worked out answers to a lot of the semantics questions underlying this in his PhD thesis, Change actions: from incremental computation to discrete derivatives, showing how incremental lambda calculus arises as a generalisation of the models of the differential lambda calculus.

He noted that the "real" semantics for change types should look like a category, in which a type is a category where the objects are values, and a morphisms \(\delta : v_1 \to v_2\) is evidence that \(v_1\) can be changed into \(v_2\). On the grounds that 2-categories are painful, Mario worked with change monoids (i.e., no typing, but yes to identities and composition), and the Giarusso-style change structures that Michael and I used retained the typed changes, but discarded all the categorical axioms. So there's more to be done here!

Monday, December 14, 2020

TypeFoundry: new ERC Consolidator Grant

I am very pleased to have received an ERC Consolidator Grant for my TypeFoundry proposal.

This will be a five year project to develop the foundations of bidirectional type inference. If you are interested in pursuing a PhD in this area, or conversely, are finishing a PhD in this area, please get in touch!

Many modern programming languages, whether developed in industry, like Rust or Java, or in academia, like Haskell or Scala, are typed. All the data in a program is classified by its type (e.g., as strings or integers), and at compile-time programs are checked for consistent usage of types, in a process called type checking. Thus, the expression 3 + 4 will be accepted, since the + operator takes two numbers as arguments, but the expression 3 + "hello" will be rejected, as it makes no sense to add a number and a string. Though this is a simple idea, sophisticated type system can track properties like algorithmic complexity, data-race freedom, differential privacy, and data abstraction.

In general, programmers must annotate programs to tell compilers the types to check. In theoretical calculi, it is easy to demand enough annotations to trivialize typechecking, but this can make the annotation burden unbearable: often larger than the program itself! So, to transfer results from formal calculi to new programming languages, we need type inference algorithms, which reconstruct missing data from partially-annotated programs.

However, the practice of type inference has outpaced its theory. Compiler authors have implemented many type inference systems, but the algorithms are often ad-hoc or folklore, and the specifications they are meant to meet are informal or nonexistent. The makes it hard to learn how to implement type inference, hard to build alternative implementations (whether for new compilers or analysis engines for IDEs), and hard for programmers to predict if refactorings will preserve typability.

In TypeFoundry, we will use recent developments in proof theory and semantics (like polarized type theory and call-by-push-value) to identify the theoretical structure underpinning type inference, and use this theory to build a collection of techniques for type inference capable of scaling up to the advanced type system features in both modern and future languages.

One of the things that makes me happy about this (beyond the obvious benefits of research funding and the recognition from my peers) is that it shows off the international character of science. I'm an Indian-American researcher, working in the UK, and being judged and funded by researchers in Europe. There has been a truly worrying amount of chauvinism and jingoism in public life recently, and the reminder that cosmopolitanism and universalism are real things too is very welcome.

Also, if you are planning on submitting a Starting Grant or Consolidator proposal to the ERC in the coming year about programming languages, verification or the like, please feel free to get in touch, and I'll be happy to share advice.

Wednesday, December 2, 2020

Church Encodings, Inductive Types, and Relational Parametricity

My blogging has been limited this past year due to RSI, but I do not want to leave things entirely fallow, and last year I wrote an email which can be edited into a decent enough blog post.

Quite often, people will hear that System F, the polymorphic lambda calculus, satisfies a property called relational parametricity. We also often hear people say that the parametricity property of System F lets you prove that a Church encoding of an inductive datatypes actually satisfies all the equational properties we expect of inductive types.

But it's surprisingly hard to find an explicit account of how you go from the basic properties of parametricy (the relational interpretation of System F and the identity extension principle) to the proof that inductive types are definable. I learned how this works from Bob Atkey's paper Relational Parametricity for Higher Kinds, but the proofs are kind of terse, since his page count was mostly focused on the new stuff he had invented.

In the sequel, I'm going to assume that you know what the relational model of system F looks like, that the term calculus satisfies the abstraction theorem (i.e., the fundamental theorem of logical relations), and also that the model satisfies the identity extension property -- if you take the relational interpretation of a type B(α), and fill in the type variable α with the equality relation for the type A, then the relational interpretation of B[A/α] will the equality relation for the type B[A/α]. In what follows I will often write Id[X] to mean the equality relation for the type X.

For completeness' sake, I also give a quick summary of the model at the end of the post.

Haskell-style functors in System F

Given all this preface, we now define a functor F as a type expression F(-) with a hole, along with an operation fmap : ∀a,b. (a → b) → F(a) → F(b), such that

fmap _ _ id = id
fmap _ _ f ∘ fmap _ _ g = fmap _ _ (f ∘ g)

I've written _ to indicate System F type arguments I'm suppressing. Basically think of these as Haskell-style functors.

Next, we can define the inductive type μF using the usual Church encoding as:

μF = ∀a. (F(a) → a) → a

foldF : ∀a. (F(a) → a) → μF → a
foldf = Λa λf : F(a) → a. λx:μF. x [a] f

inF : F(μF) → μF
inF x = Λa. λf : F(a) → a. 
          let g : μF → a       = fold [μF] [a] f in
          let h : F(μF) → F(a) = fmap [μF] [a] g in
          let v : F(a)          = h x in
          f v

I've written out inF with type-annotated local bindings to make it easier to read. If you inlined all the local bindings and suppressed type annotations, then it would read:

inF : F(μF) → μF
inF x = Λa. λf : F(a) → a. 
          f (fmap (fold f) x)

F-algebra Homomorphisms and the Graph Lemma

Now, we can prove a lemma about F-algebra homomorphisms, which is the tool we will use to prove initiality. But what are F-algebra homomorphisms?

An F-algebra is a pair (X, g : F(X) → X). An F-algebra homomorphism between two F-algebras (X, k : F(X) → X) and (Y, k' : F(Y) → Y) is a function f : X → Y such that the following ASCII-art diagram commutes:

 F(X) — k  → X
  |           |
 fmap X Y f   f
  ↓           ↓
 F(Y) — k' → Y

That is, for all u : F(X), we want f(k u) = k'(fmap [X] [Y] f u)

Before we prove something about homomorphisms, we need to prove a technical lemma called the graph lemma, which will let us connect parametricity with our functorial definitions.

Graph Lemma: Let (F, fmap) be a functor, and h : A → B be a function. Define the graph relation <h> to be the relation {(a,b) | b = f(a) }. Then F<h> ⊆ <fmap [A] [B] h>.

Proof:

  1. First, note that fmap has the type ∀a b. (a → b) → F(a) → F(b).
  2. Since fmap is parametric, we can choose the relations <h> and the identity to instantiate F with, giving us:

    (fmap [A] [B], fmap [B] [B]) ∈ (<h> → Id[B]) → (F<h> → Id<B>)
  3. Note that (h, id) ∈ <h> → Id[B].
  4. Hence (fmap [A] [B] h, fmap [B] [B] id) ∈ (F<h> → Id<B>)
  5. By functoriality, (fmap [A] [B] h, id) ∈ (F<h> → Id<B>)
  6. Assume (x, y) ∈ F<h>.
    1. Hence (fmap [A] [B] h x, id y) ∈ Id<B>)
    2. So fmap [A] [B] h x = y.
  7. Therefore (x, y) ∈ <fmap [A] [B] h>.
  8. Therefore F<h> ⊆ <fmap [A] [B] h>.

Graph relations are just functions viewed as relations, and the graph lemma tells us that the relational semantics of a type constructor applied to a graph relation <h> will behave like the implementation of the fmap term. In other words, it connects the relational semantics to the code implementing functoriality. (As an aside, this feels like it should be an equality, but I only see how to prove the inclusion.)

We use this lemma in the proof of the homomorphism lemma, which we state below:

Homomorphism Lemma: Let (F, fmap) be a functor. Given two F-algebras (A, k : F(A) → A) and (B, k' : F(B) → B), and an F-algebra homomorphism h : (A,k) → (B,k'), then for all e : μF, we have

e [B] k' = h (e [A] k)

Proof:

  1. First, by the parametricity of e, we know that

     (e [A], e [B]) ∈ (F<h> → <h>) → <h>
  2. We want to apply the arguments (k, k') to (e [A], e [B]), so we have to show that (k, k') ∈ F<h> → <h>.
  3. To show this, assume that (x, y) ∈ F<h>.

    1. Now we have to show that (k x, k' y) ∈ <h>.
    2. Unfolding the definition of <h>, we need (h (k x), k' y) ∈ Id[B].
    3. Since h is an F-algebra homomorphism, we have that h (k x) = k' (fmap [A] [B] h x).
    4. So we need to show (k' (fmap A B h x), k' y) ∈ Id[B].
    5. Now, we know that (x, y) ∈ F<h>.
    6. By the graph lemma, we know F<h> ⊆ <fmap [A] [B] h>.
    7. So (x, y) ∈ <fmap [A] [B] h>.
    8. Unfolding the definition, we know y = fmap [A] [B] h x.
    9. So we want (k' (fmap [A] [B] h x), k' (fmap [A] [B] h x)) ∈ Id[B].
    10. Since Id[B] is an equality relation, this is immediate.
  4. Hence (k, k') ∈ F<h> → <h>.
  5. Therefore (e [A] k, e [B] k') ∈ <h>.
  6. Unfolding the definition of <h>, we know e [B] k' = h(e [A] k). This is what we wanted to show.

Discussion:

The whole machinery of F-algebra homomorphisms basically exists to phrase the commuting conversions in a nice way. We just proved that for e : μF, we have

 e [B] k' = h (e [A] k)

Recall that for Church encoded inductive types, the fold is basically the identity, so this result is equivalent (up to beta) to:

 foldF [B] k' e = h (foldF [A] k' e)

So this lets us shifts contexts out of iterators if they are F-algebra homomorphisms. Note that this proof was also the one where the graph lemma is actually used.

The Beta and Eta Rules for Inductive Types

Next, we'll prove the beta- and eta-rules for inductive types. I'll do it in three steps:

  • The beta rule (eg, let (x,y) = (e1, e2) in e' == [e1/x, e2/y]e'),
  • The basic eta rule (eg, let (x,y) = e in (x,y) == e)
  • The commuting conversions (eg, C[e] = let (x,y) = e in C[(x,y)])

Theorem (beta rule): If k : F(A) → A and e : F(μF), then

 foldF [A] k (inF e) = k (fmap [μF] [A] (foldF [A] k) e)

Proof: This follows by unfolding the definitions and beta-reducing. (You don't even need parametricity for this part!)

This shows that for any F-algebra (A, k), foldF [A] k is an F-algebra homomorphism from (μF, inF) to (A, k).

Theorem (basic eta) For all e : μF, we have e = e [μF] inF

Proof:

  1. Assume an arbitrary B and k : F(B) → B.
    1. Note h = foldF [B] k is an F-algebra homomorphism from (μF, inF) to (B, k) by the beta rule.
    2. Then by our lemma, e [B] k = h (e [μF] inF).
    3. Unfolding, h (e [μF] inF) = foldF [B] k (e [μF] inF).
    4. Unfolding, foldF [B] k (e [μF] inF) = e [μF] inF [B] k.
    5. So e [B] k = e [μF] inF [B] k.
  2. So for all B and k : F(B) → B, we have e [B] k = e [μF] inF [B] k.
  3. By extensionality, e = e [μF] inF.

Theorem (commuting conversions): If k : F(A) → A and f : μF → A and f is an F-algebra homomorphism from (μF, inF) to (A, k), then f = foldF [A] k.

Proof:

  1. We want to show f = foldF [A] k.
  2. Unfolding foldF, we want to show f = λx:μF. x [A] k
  3. Assume e : μF.
    1. Now we will show f e = e [A] k.
    2. By the homomorphism lemma, and the fact that f : (μF, inF) → (A, k) we know that e [A] k = f (e [μF] inF).
    3. By the basic eta rule, e [A] k = f e
  4. So for all e, we have f e = e [A] k.
  5. Hence by extensionality f = λx:μF. x [A] k.

Note that what I (as a type theorist) called "commuting conversions" is exactly the same as "initiality" (to a category theorist), so now we have shown that the Church encoding of a functor actually lets us define the initial algebra.

Thus, we know that inductive types are definable!

Appendix: the Relational Model of System F

In this appendix, I'll quickly sketch one particular model of System F, basically inspired from the PER model of Longo and Moggi. What we will do is to define types as partial equivalence relations (PERs for short) on terms. Recall that a partial equivalence relation A on a set X is a relation on X, which is (a) symmetric (i.e., if (x, x') ∈ R then (x',x) ∈ R), and (b) transitive (if (x,y) ∈ X and (y,z) ∈ R then (x,z) ∈ R).

We take the set of semantic types to be the set of PERs on terms which are closed under evalation in either direction:

Type = { X ∈ PER(Term) | ∀x, x', y, y'. 
                           if x ↔∗ x' and y ↔∗ y' 
                           then (x,y) ∈ X ⇔ (x',y') ∈ X}

PERs form a complete lattice, with the meet given by intersection, which is the property which lets us interpret the universal quantifier of System F. We can then take a type environment θ to be a map from variables to semantic types, and interpret types as follows:

〚a〛θ ≜ θ(α)
〚A → B〛θ ≜ { (e, e') ∈ Term × Term | ∀ (t,t') ∈ 〚A〛θ. 
                                          (e t,e' t') ∈ 〚B〛θ }
〚∀a. A〛θ ≜ ⋂_{X ∈ Type} 〚A〛(θ, X/α) 

This is the PER model of System F, which is quite easy to work with but not quite strong enough to model parametricity. To model parametricity, we need a second semantics for System F, the so-called relational semantics.

A relation R between two PERs A and B over a set X, is a a relation on X which respects A and B. Suppose that (a,a') ∈ A and (b,b') ∈ B. Then, a relation R respects A and B when (a, b) ∈ R if and only if (a', b') ∈ R.

Now, suppose we have two environments θ₁ and θ₂ sending type variables to types, and a relation environment ρ that sends each type variable a to a relation respecting θ₁(a) and θ₂(a). Then we can define the relational interpretation of System F types as follows:

R〚a〛θ₁ θ₂ ρ ≜ θ(α)
R〚A → B〛θ ≜ { (e, e') ∈ Term × Term | ∀ (t,t') ∈ R〚A〛θ₁ θ₂ ρ. 
                                          (e t,e' t') ∈ R〚B〛θ₁ θ₂ ρ }
R〚∀a. A〛θ ≜ ⋂_{X, Y ∈ Type, R ∈ Rel(X,Y)} 
                  R〚A〛(θ₁, X/a) (θ₂, Y/a) (ρ, R/a)

Additionally, we need to redefine the PER model's interpretation of the ∀a.A case, to use the relational model:

〚∀a. A〛θ ≜ ⋂_{X, Y ∈ Type, R ∈ Rel(X,Y)} 
                 R〚A〛θ θ (Id[θ], R/a)

Here, to give a PER interpetation for the forall type, we use the relational interpretation, duplicating the type environment and using the identity relation for each variable (written Id(θ)). By identity relation, we mean that given a PER A, the PER A is a relation between between A and A which respects A.

This modified definition ensures the model satisfies the identity extension property -- if you take the relational interpretation of a type B(α), and fill in the type variable α with the equality relation for the type A, then the relational interpretation of B[A/α] will the equality relation for the type B[A/α]. In what follows I will often write Id[X] to mean the equality relation for the type X.

The term calculus for System F also satisfies the abstraction property, (aka the fundamental property of logical relations), which says that given a well typed term Θ; · ⊢ e : A and two type environments θ₁ and θ₂, and a relation environment ρ between them, then e is related to itself in R〚A〛θ θ ρ.

Friday, June 19, 2020

PLDI 2020 Conference Report

I just finished "attending" PLDI 2020, a programming languages conference. Like many conferences in computer science, due to COVID-19, on short notice the physical conference had to be cancelled and replaced with an online virtual conference. Talks were streamed to Youtube, questions were posted to a custom Slack channel, there was a startling variety of online chat applications to hold discussions with, and so on.

It was obviously an enormous amount of work to put together (there was even a custom chat application, Clowdr, written specifically for SIGPLAN(?) conferences), which makes me feel very unkind to report that for me, the online conference experience was a complete waste of time.

So, what went wrong?

To understand this, it is worth thinking about what the purpose of a conference is. The fundamental purpose of a conference is to meet people with shared interests and talk to them. The act of talking to people is fundamental, since it is how (a) you get to see new perspectives about subjects you are interested in, and (b) how you build the social networks that make it possible for you to become and remain an expert. (Recall the 19th century economist Alfred Marshall's description of this process: "The mysteries of the trade become no mysteries; but are as it were in the air, and children learn many of them unconsciously.”)

Even talks -- the things we build conferences around -- exist to facilitate this process of conversation. Talks at conferences really have two important functions: first, the topic of a talk is a filtering device, which helps identify the subset of the audience which is interested in this topic. This means that in the break after the talk, it is now much easier to find people to talk to who share interests with you.

Second, talks supply Schelling-style focal points: you and your interlocutor have common knowledge that you are both interested in the topic of the session, and you both also know that you saw the talk, which gives you a subject of conversation. (Note: as a speaker, you do communicate with your audience, but indirectly, as your talk becomes the seed of a conversation between audience members, which they will use to develop their own understandings.)

The fundamental problem with PLDI 2020 was that it was frustratingly hard to actually talk to people. The proliferation of timezones and chat applications meant that I found it incredibly difficult to actually find someone to talk to -- at least one of the apps (gather.town) just never worked for me, sending me into an infinite sign-up loop, and the others were either totally empty of people who were actually there, or did not seem suited for conversation. (One key issue is that many people log in to an app, and then stop paying attention to it, which makes presence indications useless.)

So if social distancing and travel restrictions due to COVID-19 remain in force (as seems likely), I think it would be better to simply convert our PL conferences fully into journals, and look for other ways to make online networking happen. The sheer amount of labour going into PLDI 2020 supplies strong evidence that simply trying to replicate a physical conference online cannot be made to work with any amount of effort.

However, before I convince everyone that I am totally right about everything, there are still online conferences that will happen -- for example, ICFP in August. So to make sure that I can look forward to the experience rather than dreading it, I will need to make a plan to actually make sure I talk to people.

So, if you are reading this, and are (or know) a researcher in PL who would like to talk, then give me a ping and we can arrange something for ICFP.

  • I am explicitly happy to talk to graduate students and postdocs about their research.

  • I know the most about dependent types, linear and modal types, type inference, separation logic, and program verification.

  • I know less about (but still like talking about) compiler stuff, theorem proving, SMT, Datalog, macros, and parsing.

  • This list is not an exhaustive list of things I'm interested in. One of the nice things about conversations is that you can use shared touchstones to discover how to like new things.

Once you get in touch, I'll put you on a list, and then once the conference talks are listed, we can organize a time to have a chat after we have all seen a talk we are interested in. (Having seen a talk means we will all have shared common knowledge fresh in our minds.) Assuming enough people are interested, I will aim for meetings of 3-5 people, including myself.

Thursday, February 13, 2020

Thought Experiment: An Introductory Compilers Class

Recently, I read a blog post in which Ben Karel summarized the reaction to a request John Regehr made about how to teach compilers, and as one might predict, the Internet was in agreement that the answer was "absolutely everything". Basically, everyone has a different perspective on what the most important thing is, and so the union of everyone's most important thing is everything.

In fact, from a certain point of view, I don't disagree. From parsing to typechecking to intermediate representations to dataflow analysis to register allocation to instruction selection to data representations to stack layouts to garbage collectors to ABIs to debuggers, basically everything in compilers is packed to the gills with absolutely gorgeous algorithms and proofs. The sheer density of wonderful stuff in compilers is just out of this world.

However, John specified an introductory course. Alas, this makes "everything" the wrong answer -- he is asking for a pedagody-first answer which fits coherently into a smallish number of lectures. This means that we have to start with the question of what we want the students to learn, and then build the curriculum around that.

The Goal of Teaching Compilers

So, what should students have learned after taking a compilers class?

The obvious (but wrong) answer is "how to write a compiler", because in all likelihood they will forget most of the techniques for implementing compilers soon after the course ends. This is because most of them – unless they catch the compiler bug – will not write very many more compilers. So if we want them to come away from a compilers class with some lasting knowledge, we have to think in terms of deeper programming techniques which (a) arise naturally when writing compilers, but (b) apply in more contexts than just compilers.

There are two obvious candidates for such techniques:

  1. Computing values by recursion over inductively-defined structures.
  2. Computing values least fixed points of monotone functions by bottom-up iteration.

Both of these are fundamental algorithmic techniques. However, I think that recursion over inductive structure is much more likely to be taught outside of a compilers course, especially considering that modern functional programming (i.e., datatypes and pattern matching) is now much more often taught elsewhere in the curriculum than it used to be.

This leaves fixed-point computations.

So let's choose the goal of an intro compilers class to be teaching fixed point computations many times over, in many different contexts, with the idea that the general technique of fixed point computation can be learned via generalizing from many specific examples. The fact that they are building a compiler is significant primarily because it means that the different examples can all be seen in a motivating context.

So, this drives some choices for the design of the course.

  1. Our goal is to show how fixed point computations arise in many different contexts, and the stages of the compilation pipeline naturally supply a non-contrived collection of very different-looking contexts. Consequently, we should design the course in a full front-to-back style, covering all the phases of the compilation pipeline.

    However, this style of course has the real weakness that it can result in shallow coverage of each topic. Mitigating this weakness drives the design of the rest of the course.

  2. In particular, the compiler structure should be very opinionated.

    The thing we want to do is to present a variety of options (LR versus LL parsing, graph colouring versus linear scan register allocation, CPS versus SSA for IRs, etc), and then present their strengths and weaknesses, and the engineering and design considerations which will lead you to favour one choice over the other.

    But for this course, we won't do any of that. As an instructor, we will simply choose the algorithm, and in particular choose the one that most benefits from fixed point computations.

    The existence of alternatives should be gestured at in lecture, but the student-facing explanation for why we are not exploring them is for their first compiler we are aiming for good choices but biased much more towards implementation simplicity than gcc or LLVM will. (This will be an easier pitch if there is an advanced compilers course, or if students have a final year project where they can write a fancy compiler.)

  3. We will also need to bite the bullet and de-emphasize the aspects of compilation where fixed point computations are less relevant.

    Therefore, we will not cover runtime systems and data structure layouts in any great detail. This substantially affects the design of the language to compile -- in particular, we should not choose a language that has closures or objects. Furthemore, we will just tell students what stack layout to use, and memory management wil be garbage collection via the Boehm gc.

Course/Compiler Structure

For the rest of this note, I'll call the language to compile Introl, because I always liked the Algol naming convention. Introl is a first-order functional language, basically what you get if you removed nested functions, lambda expressions, references, and exceptions from OCaml. I've attached a sketch of this language at the end of the post.

This language has two features which are "hard" -- polymorphism and datatypes. The reason for the inclusion of polymorphism is that it makes formulating type inference as a fixed point problem interesting, and the reason datatypes exist is because (a) match statements offer interesting control flow, and (b) they really show off what sparse conditional constrant propagation can do.

So the course will then follow the top-down lexing to code generation approach that so many bemoan, but which (in the context of our goals) is actually totally appropriate.

Lexing and Parsing

For lexing, I would start with the usual regular expressions and NFAs thing, but then take a bit of a left turn. First, I would show them that state set sizes could explode, and then introduce them to Brüggeman-Klein and Derick Wood's deterministic regular expressions as a way of preventing this explosion.

The reason for this is that the conditions essentially check whether a regular expression is parseable by recursive descent without backtracking -- i.e., you calculate NULLABLE, FIRST and (a variant of) FOLLOW sets for the regular expression. This lets you explain what these sets mean in a context without recursion or fixed points, which makes it easy to transition to LL(1) grammars, which are fundamentally just deterministic regular languages plus recursion.

So then the calculation of these sets as a fixed point equation is very easy, and using the deterministic regular languages means that the explanation of what these sets mean can be decoupled from how to compute via a fixed point computation.

Naturally, this means that the grammar of Introl must be LL(1).

Typechecking

Typechecking for this kind of language is pretty routine, but in this case it should be phrased as an abstract interpretation, in the style of Cousot's Types as Abstract Interpretation.

The interesting thing here is that polymorphism can be presented via what Cousot called "the Herbrand abstraction". The idea is that the abstract elements are monotypes with unification variables in them, with the partial order that t₂ ≥ t₁ if there is a substitution σ such that t₂ = σ(t₁), and the unification algorithm itself as an attempt to calculate the substitution witnessing the join of two terms. I say attempt, since unification can fail. So this is a kind of partial join operation -- in the case you cannot join the two terms, there must have been a type error!

In Introl as presented, top-level functions have a type annotation, and so it will work out that you end up not needing to do a serious fixed point computation to infer types. Indeed, even if you omitted annotations, the fact that unification is calculating a most general unifier means that the fixed point iteration for recursive functions terminates in one iteration. (This should not be too surprising since the Dama-Milner algorithm only needs one traversal of the syntax.)

This fact is worth working out, because a great deal of life in static analysis involves trying to find excuses to iterate less. Indeed, this is what motivates the move to SSA – which will be the very next phase of the compiler.

Conversion to SSA/CPS

The choice of IR is always a fraught one in a compiler class. In this course, I would use a static single assignment (SSA) representation. SSA is a good choice because (a) it simplifies implementing all kinds of dataflow optimizations, (b) generating it also needs dataflow analyses, and (c) it is the IR of choice for serious compilers. (a) and (b) mean we get to do lots of fixed point computations, and (c) means it won't feel artificial to today's yoof.

However, I wouldn't go directly to SSA, since I find φ-functions very difficult to explain directly. IMO, it is worth pulling a trick out of Richard Kelsey's hat, and exploiting the correspondence between continuation-passing style (CPS), and static single assignment form (SSA).

CPS often comes with all sorts of mysticism and higher-order functions, but in this case, it works out very simply. Basically, we can let-normalize the program, and then transform the surface language into basic blocks with arguments (or if you prefer, a bunch of functions tail-calling each other). This is the version of SSA that the MLton compiler used, and which (if memory serves) the Swift SIL interemediate representation uses as well.

Roughly speaking, each basic block in the program will end up have zero or more formal arguments, and jumps to that block have arguments which fill in the arguments. This ends up making it super easy to explain, because the program just goes into let-normal form, and all tail calls get compiled to jumps-with-arguments, and non-tail calls use a call IR instruction.

Now, if we do this maximally naively – which we absolutely want to do – we will get nearly pessimal SSA, since each basic block will take as arguments all the variables in its scope. The reason we do this is to make the need for SSA minimization obvious: we just want to shrink each block's parameter lists so it doesn't take parameters for variables which either don't vary or don't get used.

Doing this will take us to something close to "pruned SSA" form, which would normally be overkill for a first compiler, except that we want to use it to motivate the computation of the dominator tree. Because of the either-or nature of the shrinking, we can do this with two analyses, a liveness analysis and a constancy analysis.

Both of these are dataflow analyses, and we can show how to use the dominator tree to order the work to calculate the calculate the fixed point faster. This will justify doing the fixed point computation to calculate the dominance information we need.

There is a bit of redundancy here, which is deliberate – I think it's important to have done two analyses before calculating dominators, because doing one fixed point computation to speed up one fixed point computation may feel artificial. But doing one computation that speeds up two computations is easy to see the benefit of, especially if we phrase the fixed point computation as taking the transfer function plus the dominance information.

I would avoid using one of the fast dominance algorithms, because pedagogically it's easier to explain the calculation when it is close to the definition.

The other nice thing here is that typechecking got accelerated by a good choice of abstract domain, and flow analysis gets accelerated by a good choice of iteration order.

Once the students have some experience manipulating this representation, I would probably switch to the traditional SSA form. The main benefit of this is just teaching the ability to read the existing literature more.

High-level Optimizations

One thing worth spelling out is that this language (like MLton's SSA) has a high-level SSA representation – switches, data constructors, and field selectors and things like that will all be part of the SSA IR. This makes it possible to do optimizations while thinking at the language level, rather than the machine level. And furthermore, since students have a decent SSA representation, we certainly should use it to do all the "easy" SSA optimizations, like copy propagation and loop-invariant code motion.

All of these are unconditional rewrites when in SSA form, so they are all easy to implement, and will get the students comfortable with manipulating the SSA representation. The next step is to implement dead code elimination, which is both a nice optimization, and also requires them to re-run liveness analysis. This will open the door to understanding that compilation passes may have to be done repeatedly.

Once the students have done that, they will be warmed up for the big "hero optimization" of the course, which will be sparse conditional constant propagation. Because we are working with a high-level IR, sparse conditional constant propagation ought to yield even more impressive results than usual, with lots of tuple creation/pattern matching and cases on constructors disappearing in a really satisfying way.

In particular, good examples ought to arise from erasing the None branches from code using the option monad for safe division safediv : int → int → option int, if there is a test that the divisors are nonzero dominating the divisions.

Lowering and Register Allocation

As I mentioned before, the big optimization was performed on a high-level SSA form: switch statements and data constructors and selectors are still present. Before we can generate machine code, they will have to be turned into lower-level operations. So we can define a "low-level" SSA, where the selections and so on are turned into memory operations, and then translate the high-level IR into the low-level IR.

This ought to be done pretty naively, with each high-level operation turning into its low-level instruction sequence in a pretty direct way. Since the course design was to do as many dataflow optimizations as we could, to bring out the generic structure, a good test the development is smooth enough is whether the flow analyses are sufficiently parameterized that we can change the IR and lattices and still get decent code reuse. Then we can lean on these analyses to clean up the low-level instruction sequences.

If more abstract code is too hard to understand, I would skip most of the low-level optimizations. The only must-do analysis at the low-level representation is to re-reun liveness analysis, so that we can do register allocation using the SSA-form register allocation algorithms of Hack. This (a) lets us avoid graph colouring, (b) while still getting good results, and (c) depends very obviously on the results of a dataflow analysis. It also makes register coalescing (normally an advanced topic) surprisingly easy.

Code generation is then pretty easy – once we've done register allocation, we can map the low-level IR operations to loads, stores, and jumps in the dirt-obvious way; calls and returns doing the obvious stack manipulations; and foreign calls to the Boehm gc to allocate objects. This is not optimal, but most of the techniques I know of in this space don't use fixed points much, and so are off-piste relative to the course aim.

Perspectives

The surprising (to me, anyway) thing is that the choice of focusing on fixed point computations plus the choice to go SSA-only, lays out a path to a surprisingly credible compiler. Before writing this note I would have thought this was definitely too ambitious a compiler for a first compiler! But now, I think it only may be too ambitious.

Obviously, we benefit a lot from working with a first-order functional langage: in many ways, is just an alternate notation for SSA. However, it looks different from SSA, and it looks like a high-level language, which means the students ought to feel like they have accomplished something.

But before I would be willing to try teaching it this way, I would want to program up a compiler or two in this style. This would either confirm that the compiler is tractable, or would show that the compiler is just too big to be teachable. If it is too big, I'd probably change the surface language to only support booleans and integers as types. Since these values all fit inside a word, we could drop the lowering pass altogether.

Another thing I like is that we are able to do a plain old dataflow analysis to parse, and then a dataflow analysis with a clever representation of facts when typechecking, and then dataflow analyses with a clever iteration scheme when doing optimization/codegen.

However, if this ends up being too cute and confusion-prone, another alternative would be to drop type inference by moving to a simply typed language, and then drop SSA in favour of the traditional Kildall/Allen dataflow approach. You could still teach the use of acceleration structures by computing def-use chains and using them for (e.g.) liveness. This would bring out the parallels between parsing and flow analysis.

Relative to an imaginary advanced compilers course, we're missing high-level optimizations like inlining, partial redundancy elimination, and fancy loop optimizations, as well as low-level things like instruction scheduling. We're also missing an angle for tackling control-flow analysis for higher-order language featuring objects or closures.

But since we were able to go SSA straight away, the students will be in a good position to learn this on their own.

Introl sketch

Type structure

The type language has

  • Integers int
  • Unit type unit
  • Tuples T_1 * ... * T_n
  • Sum types via datatype declarations:

    datatype list a = Nil | Cons of a * list a 

Program structure

Top level definitions

All function declarations are top level. Nested functions and anonymous lambda-expressions are not allowed. Function declarations have a type scheme, which may be polymorphic:

    val foo : forall a_1, ..., a_n. (T_1, ..., T_n) -> T 
    def foo(x_1, ..., x_n) = e 

Expressions

Expressions are basically the usual things:

  • variables x
  • constants 3
  • arithmetic e_1 + e_2
  • unit ()
  • tuples (e_1, ..., e_n)
  • data constructors Foo(e)
  • variable bindings let p = e_1; e_2
  • direct function calls f(e_1, ..., e_n), where f is a top-level function name
  • match statements match e { p_1 -> e_1 | ... p_n -> e_n }

Note the absence of control structures like while or for; we will use tail-recursion for that. Note also that there are no explicit type arguments in calls to polymorphic functions.

Patterns

Patterns exist, but nested patterns do not, in order to avoid having to deal with pattern compilation. They are:

  • variables x
  • constructor patterns Foo(x_1, ..., x_n)
  • tuple patterns (x_1, ..., x_n)

Example

Here's a function that reverses a list:

val reverse_acc : forall a. (list a, list a) -> list a 
def reverse_acc(xs, acc) = 
  match xs {
   | Nil -> acc
   | Cons(x, xs) -> reverse_acc(xs, Cons(x, acc))
  }

val reverse : forall a. list a -> list a 
def reverse(xs) = reverse_acc(xs, Nil)

Here's a function to zip two lists together:

val zip : forall a, b. (list a, list b) -> list (a * b)
def zip(as, bs) = 
  match as { 
   | Nil -> Nil
   | Cons(a, as') -> 
      match bs { 
       | Nil -> Nil
       | Cons(b, bs') -> Cons((a,b), zip(as',bs'))
      }
  }

Note the lack of nested patterns vis-a-vis ML.

Wednesday, October 30, 2019

Every Finite Automaton has a Corresponding Regular Expression

Today I'd like to blog about the nicest proof of Kleene's theorem that regular expressions and finite automata are equivalent that I know. One direction of this proof -- that every regular expression has a corresponding finite automaton -- is very well-known to programmers, since Thompson's construction, which converts regular expressions to nondeterministic automta, and the subset construction, which turns nondeterministic automta into deterministic automta, are the basis of every lexer and regexp engine there is.

However, the proof of the other direction -- that every automaton has a corresponding regexp -- is often stated, but seldom proved. This is a shame, since it has an elegant formulation in terms of linear algebra(!).

The Construction

FIrst, let's assume we have a regular language \((R, 1, \cdot, 0, +, \ast)\). Now, define the category \(C\) as follows:

  • The objects of \(C\) are the natural numbers.
  • A morphism \(f : n \to m\) is an \(n \times m\) matrix of elements of \(R\). We will write \(f_{(i,j)}\) to get the element of \(R\) in the \(i\)-th row and \(j\)-th column.
  • The identity morphism \(id : n \to n\) is the \(n \times n\) matrix which is \(1\) on the diagonal and \(0\) off the diagonal
  • The composition \(f;g\) of \(f : n \to m\) and \(g : m \to o\) is given by matrix multiplication:

\[ (f;g)_{i, j} = \sum_{k \in 0\ldots m} f_{(i,k)} \cdot g_{(k,j)} \]

This may seem like a rather peculiar thing to do -- matrices filled with regular expressions? However, to understand why this is potentially a sensible idea, let's look at a generic automaton of dimension 2.

G 0 0 0->0 a 1 1 0->1 b 1->0 c 1->1 d

Given such a picture, we can give the transition matrix for this as follows:

\[ A = \begin{array}{l|cc} & 0 & 1 \\\hline 0 & a & b \\ 1 & c & d \\ \end{array} \]

So the \((i,j)\)-th entry tells us the label of the edge from node \(i\) to node \(j\). In our category, this is an endomap \(A : 2 \to 2\). What happens if we compose \(A\) with itself? Since composition is just matrix multiplication, we get:

\[ \left[\begin{array}{cc} a & b \\ c & d \end{array}\right] \cdot \left[\begin{array}{cc} a & b \\ c & d \end{array}\right] = \left[\begin{array}{cc} aa + bc & ab + bd \\ ca + dc & cb + dd \end{array}\right] \]

Notice that the \((i,j)\)-th entry is a regular expression describing the paths of length two! So composition of arrows corresponds to path composition. The identity matrix are the paths of length 0, the transition matrix gives the paths of length 1, and matrix multiplication gives path concatentation. So it's easy to get a regexp for any bounded length path. Indeed, it is easy to see that the endomaps of \(n\) form a semiring.

So the endomaps of \(n\) are \(n \times n\) matrixes, and we can turn them into a semiring \((S, 1_S, \cdot_S, 0_S, +_S)\) as follows:

  • The carrier \(S\) is \(\mathrm{Hom}(n,n)\), the set of \(n \times n\) matrices over \(R\).
  • The unit \(1_S\) is the identity map \(\mathrm{id} : n \to n\)
  • The multiplication \(\cdot_S\) is composition of morphisms (i.e., matrix multiplication). Since these are endomorphisms (ie, square matrices), the result of composition (i.e., multiplication of square matrices) is another endomorphism (ie, square matrix of the same dimension).
  • The zero \(0_S\) is the matrix with all 0 entries.
  • The addition \(f + g\) is the pointwise addition of matrices -- i.e., \((f + g)_{i, j} = f_{i,j} + g_{i,j}\)

Now we can ask if we can equip endomaps with a Kleene star operation as well? Recall that the defining equation for the Kleene star is

\[ s\ast = 1 + s\cdot s\ast \]

which says that we are looking for a transitive closure operation. That is, \(s\) consists of 0 length paths, or anything in \(s\) plus anything in \(s\ast\). So we want a definition of the Kleene star which satisfies this equation. Let's consider the case of the size 2 automaton again:

G 0 0 0->0 a 1 1 0->1 b 1->0 c 1->1 d

Now, let's think about the loop-free paths between 0 and 0. There are two classes of them:

  1. The single step \(a\).
  2. We can take a step from 0 to 1 via \(b\), take some number of steps from 1 to itself via \(d\), and then back to 0 via \(c\).

So we can describe the loop-free paths from 0 to 0 via the regular expression

\[ F \triangleq a + b(d\ast)c \]

Similarly, the loop-free paths from 1 to 1 are either:

  1. The single step \(d\).
  2. We can take a step from 1 to 0 via \(c\), take some number of steps from 1 to itself via \(a\), and then back to 1 via \(b\).

So these paths are described by \[ G \triangleq d + c(a\ast)b \]

So the set of all paths from 0 to 0 is \(F\ast\), and the set of all paths from \(1\) to \(1\) is \(G\ast\). This lets us write down the

\[ A\ast = \begin{array}{l|cc} & 0 & 1 \\\hline 0 & F\ast & {F\ast} \cdot b \cdot {G\ast} \\ 1 & {G\ast} \cdot c \cdot {F\ast} & G\ast \\ \end{array} \]

Proving that this is a correct definition requires a bit of work, so I will skip over it. That's because I want to get to the cool bit, which is that this construction is enough to define the Kleene star for any dimension \(n\)! We'll prove this by well-founded induction. First, suppose we have an \(n \times n\) transition matrix \(T\) for \(n > 2\). Then, we can divide the matrix into blocks, with \(A\) and \(D\) as square submatrices:

\[ T = \left[ \begin{array}{c|c} A & B \\\hline C & D \end{array} \right] \]

and then define \(T\ast\) as:

\[ T\ast = \left[ \begin{array}{c|c} F\ast & {F\ast} \cdot B \cdot {G\ast} \\ \hline {G\ast} \cdot C \cdot {F\ast} & G\ast \\ \end{array} \right] \]

where

\[ \begin{array}{lcl} F & \triangleq & A + B(D\ast)C \\ G & \triangleq & D + C(A\ast)B \\ \end{array} \]

By induction, we know that square submatrices of dimension smaller than \(n\) is a Kleene algebra, so we know that \(A\) and \(D\) have Kleene algebra structure. The reason for defining the categorical structure also becomes apparent here -- even though \(A\) and \(D\) are square, \(B\) and \(C\) may not be, and we will need to make use of the fact we defined composition on non-square matrices for them.

References and Miscellany

I have, as is traditional for bloggers, failed to prove any of my assertions. Luckily other people have done the work! Two papers that I particularly like are:

One fact which didn't fit into the narrative above, but which I think is extremely neat, is about the linear-algebraic view of the Kleene star. To see what it is, recall the definition of the matrix exponential for a matrix \(A\):

\[ e^A \triangleq \sum_{k = 0}^\infty \frac{A^k}{k!} = I + A + \frac{A^2}{2!} + \frac{A^3}{3!} + \ldots \]

Now, note that

  1. \(k!\) is an integer
  2. any integer can be expressed as a formal sum \(n = \overbrace{1 + \ldots + 1}^{n \,\mbox{times}}\),
  3. and in regular languages, we have \(r + r = r\).

Hence \(k! = 1\), and since \(A/1 = A\), we have

\[ e^A \triangleq \sum_{k = 0}^\infty A^k = I + A + A^2 + A^3 + \ldots \]

But this is precisely the definition of \(A\ast\)! In other words, for square matrices over Kleene algebras, Kleene iteration is precisely the same thing as the matrix exponential.

The Implementation

The nice thing about this construction is that you can turn it into running code more or less directly, which leads to an implementation of the DFA-to-regexp algorithm in less than 100 lines of code. I'll give an implementation in OCaml.

First, let's give a signature for Kleene algebras. All the constructions we talked about are parameterized in the Kleene algebra, so our implementation will be too.

module type KLEENE = sig
  type t 
  val zero : t 
  val ( ++ ) : t -> t -> t 
  val one : t
  val ( * ) : t -> t -> t 
  val star : t -> t 
end

You can see here that we have zero and one as the unit for the addition ++ and multiplication * respectively, and we also have a star operation representing the Kleene star.

Similarly, we can give a signature for matrices over an element type.

module type MATRIX = sig
  type elt
  type t 

  val dom : t -> int
  val cod : t -> int
  val get : t -> int -> int -> elt

  val id : int -> t 
  val compose : t -> t -> t 

  val zero : int -> int -> t 
  val (++) : t -> t -> t 

  val make : int -> int -> (int -> int -> elt) -> t 
end

So this signature requires an element type elt, a matrix type t, and a collection of operations. We can the number of rows in the matrix with dom and the columns with cod, with the names coming from the thought of dimensions and matrices forming a category. Similarly this gives us a notion of identity with id, which takes an integer and returns the identity matrix of that size. Matrix multiplication is just composition, which leads to the name compose.

We also want the ability to create zero matrices with the zero function, and the ability to do pointwise addition of matrices with the (++) operation. Neither of these operations need square matrices, but to add matrices pointwise requires compatible dimensions.

Finally we want a way to construct a matrix with make, which is given the dimensions and a function to produce the elements.

Now, since the matrix construction was parameterized by a Kleene algebra, we can write a functor Matrix in Ocaml which takes a module of type KLEENE to produce an implementation of the MATRIX signature with elements of the Kleene algebra as the element type:

module Matrix(R : KLEENE) : MATRIX with type elt = R.t = 
struct
  type elt = R.t 
  type t = {
      row : int; 
      col: int; 
      data : int -> int -> R.t
    }

We take the element type to be elements of the Kleene algebra, and take a matrix to be a record with the number of rows, the number of columns, and a function to produce the entries. This is not a high-performance choice, but it works fine for illustration purposes.

  let make row col data = {row; col; data}
  let dom m = m.row
  let cod m = m.col

  let get m i j = 
      assert (0 <= i && i < m.row);
      assert (0 <= j && j < m.col);
      m.data i j 

To make a matrix, we just use the data from the constructor to build the appropriate record, and to get the domain and codomain we just select the row or column fields. To get an element we just call the underlying function, with an assertion check to ensure the arguments are in bounds.

This is a point where a more dependently-typed language would be nice: if the dimensions of the matrix were tracked in the type then we could omit the assertion checks. (Basically, the element type would become a type constructor t : nat -> nat -> type, which would be a type constructor corresponding to the hom-set,)

  let id n = make n n (fun i j -> if i = j then R.one else R.zero)

  let compose f g = 
    assert (cod f = dom g);
    let data i j = 
      let rec loop k acc = 
        if k < cod f then
          loop (k+1) R.(acc ++ (get f i k * get g k j))
        else
          acc
      in
      loop 0 R.zero
    in
    make (dom f) (cod g) data

The identity function takes an integer, and returns a square matrix that is R.one on the diagonal and R.zero off-diagonal. If you want to compose two matrices, the routine checks that the dimension match, and then implements a very naive row-vector times column-vector calculation for each indexing. What makes this naive is that it will get recalculated each time!

  let zero row col = make row col (fun i j -> R.zero)

  let (++) m1 m2 = 
    assert (dom m1 = dom m2);
    assert (cod m1 = cod m2);
    let data i j = R.(++) (get m1 i j) (get m2 i j) in
    make (dom m1) (cod m1) data 
end                           

Finally we can implement the zero matrix operation, and matrix addition, both of which work the expected way.

We will use all these operations to show that square matrices form a Kleene algebra. To show this, we will define a functor Square which is parameterized by

  1. A Kleene algebra R.
  2. An implementation of matrices with element type R.t
  3. A size (i.e., an integer)
module Square(R : KLEENE)
             (M : MATRIX with type elt = R.t)
             (Size : sig val size : int end) : KLEENE with type t = M.t = 
struct
  type t = M.t 

The implementation type is the matrix type M.t, and the semiring operations are basically inherited from the module M:

  let (++) = M.(++)
  let ( * ) = M.compose
  let one = M.id Size.size
  let zero = M.zero Size.size Size.size

The only changes are that one and zero are restricted to square matrices of size Size.size, and so we don't need to pass them dimension arguments.

All of the real work is handled in the implementation of the Kleene star. We can follow the construction very directly. First, we can define a split operation which splits a square matrix into four smaller pieces:

  let split i m = 
    let open M in 
    assert (i >= 0 && i < dom m);
    let k = dom m - i in 
    let shift m x y = fun i j -> get m (i + x) (j + y) in 
    let a = make i i (shift m 0 0) in
    let b = make i k (shift m 0 i) in
    let c = make k i (shift m i 0) in
    let d = make k k (shift m i i) in 
    (a, b, c, d)

This divides a matrix into four, by finding the split point i, and then making a new matrix that uses the old matrix's data with a suitable offset. So b's x-th row and y-th column will be the x-th row and (y+i)-th column of the original, for exmample.

Symmetrically, we need a merge operation.

  let merge (a, b, c, d) = 
    assert (M.dom a = M.dom b && M.cod a = M.cod c);
    assert (M.dom d = M.dom c && M.cod d = M.cod b);
    let get i j = 
      match i < M.dom a, j < M.cod a with
      | true, true   -> M.get a i j 
      | true, false  -> M.get b i (j - M.cod a)
      | false, true  -> M.get c (i - M.dom a) j 
      | false, false -> M.get d (i - M.dom a) (j - M.cod a)
    in
    M.make (M.dom a + M.dom d) (M.cod a + M.cod d) get 

To merge four matrices, their dimensions need to match up properly, and then the lookup operation get needs to use the indices to decide which quadrant to look up the answer in.

This can be used to write the star function:

  let rec star m = 
    match M.dom m with
    | 0 -> m 
    | 1 -> M.make 1 1 (fun _ _ -> R.star (M.get m 0 0))
    | n -> let (a, b, c, d) = split (n / 2) m in
           let fstar = star (a ++ b * star d * c) in 
           let gstar = star (d ++ c * star a * b) in 
           merge (fstar, fstar * b * gstar,
                  gstar * c * fstar, gstar)
end

This function looks at the dimension of the matrix. If the matrix is dimension 1, we can just use the underlying Kleene star on the singleton entry in the matrix. If the square matrix is bigger, we can split the matrix in four and then directly apply the construction we saw earlier. We try to divide the four quadrants as nearly in half as we can, because this reduces the number of times we need to make a recursive call.

As an example of using this, let's see how we can convert automata given by state-transition functions to regular expressions.

First, we define a trivial Kleene algebra using a datatype of regular expressions, and use this to set up a module of matrices:

module Re = struct
  type t = 
    | Chr of char
    | Seq of t * t 
    | Eps
    | Bot 
    | Alt of t * t 
    | Star of t 

  let zero = Bot
  let one = Eps
  let ( * ) r1 r2 = Seq(r1, r2)
  let ( ++ ) r1 r2 = Alt(r1, r2)
  let star r = Star r 
end

module M = Matrix(Re)

Then, we can construct a Kleene algebra of matrices of size 2, and then give a 2-state automaton by its transition function.

module T = Square(Re)(M)(struct let size = 2 end)
let t = 
  M.make 2 2 (fun i j -> 
      Re.Chr (match i, j with
          | 0, 0 -> 'a'
          | 0, 1 -> 'b'
          | 1, 0 -> 'c'
          | 1, 1 -> 'd'
          | _ -> assert false))

Now, we can use the Kleene star on T to compute the transitive closure of this transition function:

# M.get tstar 0 1;;
- : M.elt =
Seq (Star (Alt (Chr 'a', Seq (Chr 'b', Seq (Star (Chr 'd'), Chr 'c')))),
 Seq (Chr 'b',
  Star (Alt (Chr 'd', Seq (Chr 'c', Seq (Star (Chr 'a'), Chr 'b'))))))

Voila! A regexp corresponding to the possible paths from 0 to 1.

The code is available as a Github gist. Bugfixes welcome!