Wednesday, October 30, 2019

Every Finite Automaton has a Corresponding Regular Expression

Today I'd like to blog about the nicest proof of Kleene's theorem that regular expressions and finite automata are equivalent that I know. One direction of this proof -- that every regular expression has a corresponding finite automaton -- is very well-known to programmers, since Thompson's construction, which converts regular expressions to nondeterministic automta, and the subset construction, which turns nondeterministic automta into deterministic automta, are the basis of every lexer and regexp engine there is.

However, the proof of the other direction -- that every automaton has a corresponding regexp -- is often stated, but seldom proved. This is a shame, since it has an elegant formulation in terms of linear algebra(!).

The Construction

FIrst, let's assume we have a regular language \((R, 1, \cdot, 0, +, \ast)\). Now, define the category \(C\) as follows:

  • The objects of \(C\) are the natural numbers.
  • A morphism \(f : n \to m\) is an \(n \times m\) matrix of elements of \(R\). We will write \(f_{(i,j)}\) to get the element of \(R\) in the \(i\)-th row and \(j\)-th column.
  • The identity morphism \(id : n \to n\) is the \(n \times n\) matrix which is \(1\) on the diagonal and \(0\) off the diagonal
  • The composition \(f;g\) of \(f : n \to m\) and \(g : m \to o\) is given by matrix multiplication:

\[ (f;g)_{i, j} = \sum_{k \in 0\ldots m} f_{(i,k)} \cdot g_{(k,j)} \]

This may seem like a rather peculiar thing to do -- matrices filled with regular expressions? However, to understand why this is potentially a sensible idea, let's look at a generic automaton of dimension 2.

G 0 0 0->0 a 1 1 0->1 b 1->0 c 1->1 d

Given such a picture, we can give the transition matrix for this as follows:

\[ A = \begin{array}{l|cc} & 0 & 1 \\\hline 0 & a & b \\ 1 & c & d \\ \end{array} \]

So the \((i,j)\)-th entry tells us the label of the edge from node \(i\) to node \(j\). In our category, this is an endomap \(A : 2 \to 2\). What happens if we compose \(A\) with itself? Since composition is just matrix multiplication, we get:

\[ \left[\begin{array}{cc} a & b \\ c & d \end{array}\right] \cdot \left[\begin{array}{cc} a & b \\ c & d \end{array}\right] = \left[\begin{array}{cc} aa + bc & ab + bd \\ ca + dc & cb + dd \end{array}\right] \]

Notice that the \((i,j)\)-th entry is a regular expression describing the paths of length two! So composition of arrows corresponds to path composition. The identity matrix are the paths of length 0, the transition matrix gives the paths of length 1, and matrix multiplication gives path concatentation. So it's easy to get a regexp for any bounded length path. Indeed, it is easy to see that the endomaps of \(n\) form a semiring.

So the endomaps of \(n\) are \(n \times n\) matrixes, and we can turn them into a semiring \((S, 1_S, \cdot_S, 0_S, +_S)\) as follows:

  • The carrier \(S\) is \(\mathrm{Hom}(n,n)\), the set of \(n \times n\) matrices over \(R\).
  • The unit \(1_S\) is the identity map \(\mathrm{id} : n \to n\)
  • The multiplication \(\cdot_S\) is composition of morphisms (i.e., matrix multiplication). Since these are endomorphisms (ie, square matrices), the result of composition (i.e., multiplication of square matrices) is another endomorphism (ie, square matrix of the same dimension).
  • The zero \(0_S\) is the matrix with all 0 entries.
  • The addition \(f + g\) is the pointwise addition of matrices -- i.e., \((f + g)_{i, j} = f_{i,j} + g_{i,j}\)

Now we can ask if we can equip endomaps with a Kleene star operation as well? Recall that the defining equation for the Kleene star is

\[ s\ast = 1 + s\cdot s\ast \]

which says that we are looking for a transitive closure operation. That is, \(s\) consists of 0 length paths, or anything in \(s\) plus anything in \(s\ast\). So we want a definition of the Kleene star which satisfies this equation. Let's consider the case of the size 2 automaton again:

G 0 0 0->0 a 1 1 0->1 b 1->0 c 1->1 d

Now, let's think about the loop-free paths between 0 and 0. There are two classes of them:

  1. The single step \(a\).
  2. We can take a step from 0 to 1 via \(b\), take some number of steps from 1 to itself via \(d\), and then back to 0 via \(c\).

So we can describe the loop-free paths from 0 to 0 via the regular expression

\[ F \triangleq a + b(d\ast)c \]

Similarly, the loop-free paths from 1 to 1 are either:

  1. The single step \(d\).
  2. We can take a step from 1 to 0 via \(c\), take some number of steps from 1 to itself via \(a\), and then back to 1 via \(b\).

So these paths are described by \[ G \triangleq d + c(a\ast)b \]

So the set of all paths from 0 to 0 is \(F\ast\), and the set of all paths from \(1\) to \(1\) is \(G\ast\). This lets us write down the

\[ A\ast = \begin{array}{l|cc} & 0 & 1 \\\hline 0 & F\ast & {F\ast} \cdot b \cdot {G\ast} \\ 1 & {G\ast} \cdot c \cdot {F\ast} & G\ast \\ \end{array} \]

Proving that this is a correct definition requires a bit of work, so I will skip over it. That's because I want to get to the cool bit, which is that this construction is enough to define the Kleene star for any dimension \(n\)! We'll prove this by well-founded induction. First, suppose we have an \(n \times n\) transition matrix \(T\) for \(n > 2\). Then, we can divide the matrix into blocks, with \(A\) and \(D\) as square submatrices:

\[ T = \left[ \begin{array}{c|c} A & B \\\hline C & D \end{array} \right] \]

and then define \(T\ast\) as:

\[ T\ast = \left[ \begin{array}{c|c} F\ast & {F\ast} \cdot B \cdot {G\ast} \\ \hline {G\ast} \cdot C \cdot {F\ast} & G\ast \\ \end{array} \right] \]

where

\[ \begin{array}{lcl} F & \triangleq & A + B(D\ast)C \\ G & \triangleq & D + C(A\ast)B \\ \end{array} \]

By induction, we know that square submatrices of dimension smaller than \(n\) is a Kleene algebra, so we know that \(A\) and \(D\) have Kleene algebra structure. The reason for defining the categorical structure also becomes apparent here -- even though \(A\) and \(D\) are square, \(B\) and \(C\) may not be, and we will need to make use of the fact we defined composition on non-square matrices for them.

References and Miscellany

I have, as is traditional for bloggers, failed to prove any of my assertions. Luckily other people have done the work! Two papers that I particularly like are:

One fact which didn't fit into the narrative above, but which I think is extremely neat, is about the linear-algebraic view of the Kleene star. To see what it is, recall the definition of the matrix exponential for a matrix \(A\):

\[ e^A \triangleq \sum_{k = 0}^\infty \frac{A^k}{k!} = I + A + \frac{A^2}{2!} + \frac{A^3}{3!} + \ldots \]

Now, note that

  1. \(k!\) is an integer
  2. any integer can be expressed as a formal sum \(n = \overbrace{1 + \ldots + 1}^{n \,\mbox{times}}\),
  3. and in regular languages, we have \(r + r = r\).

Hence \(k! = 1\), and since \(A/1 = A\), we have

\[ e^A \triangleq \sum_{k = 0}^\infty A^k = I + A + A^2 + A^3 + \ldots \]

But this is precisely the definition of \(A\ast\)! In other words, for square matrices over Kleene algebras, Kleene iteration is precisely the same thing as the matrix exponential.

The Implementation

The nice thing about this construction is that you can turn it into running code more or less directly, which leads to an implementation of the DFA-to-regexp algorithm in less than 100 lines of code. I'll give an implementation in OCaml.

First, let's give a signature for Kleene algebras. All the constructions we talked about are parameterized in the Kleene algebra, so our implementation will be too.

module type KLEENE = sig
  type t 
  val zero : t 
  val ( ++ ) : t -> t -> t 
  val one : t
  val ( * ) : t -> t -> t 
  val star : t -> t 
end

You can see here that we have zero and one as the unit for the addition ++ and multiplication * respectively, and we also have a star operation representing the Kleene star.

Similarly, we can give a signature for matrices over an element type.

module type MATRIX = sig
  type elt
  type t 

  val dom : t -> int
  val cod : t -> int
  val get : t -> int -> int -> elt

  val id : int -> t 
  val compose : t -> t -> t 

  val zero : int -> int -> t 
  val (++) : t -> t -> t 

  val make : int -> int -> (int -> int -> elt) -> t 
end

So this signature requires an element type elt, a matrix type t, and a collection of operations. We can the number of rows in the matrix with dom and the columns with cod, with the names coming from the thought of dimensions and matrices forming a category. Similarly this gives us a notion of identity with id, which takes an integer and returns the identity matrix of that size. Matrix multiplication is just composition, which leads to the name compose.

We also want the ability to create zero matrices with the zero function, and the ability to do pointwise addition of matrices with the (++) operation. Neither of these operations need square matrices, but to add matrices pointwise requires compatible dimensions.

Finally we want a way to construct a matrix with make, which is given the dimensions and a function to produce the elements.

Now, since the matrix construction was parameterized by a Kleene algebra, we can write a functor Matrix in Ocaml which takes a module of type KLEENE to produce an implementation of the MATRIX signature with elements of the Kleene algebra as the element type:

module Matrix(R : KLEENE) : MATRIX with type elt = R.t = 
struct
  type elt = R.t 
  type t = {
      row : int; 
      col: int; 
      data : int -> int -> R.t
    }

We take the element type to be elements of the Kleene algebra, and take a matrix to be a record with the number of rows, the number of columns, and a function to produce the entries. This is not a high-performance choice, but it works fine for illustration purposes.

  let make row col data = {row; col; data}
  let dom m = m.row
  let cod m = m.col

  let get m i j = 
      assert (0 <= i && i < m.row);
      assert (0 <= j && j < m.col);
      m.data i j 

To make a matrix, we just use the data from the constructor to build the appropriate record, and to get the domain and codomain we just select the row or column fields. To get an element we just call the underlying function, with an assertion check to ensure the arguments are in bounds.

This is a point where a more dependently-typed language would be nice: if the dimensions of the matrix were tracked in the type then we could omit the assertion checks. (Basically, the element type would become a type constructor t : nat -> nat -> type, which would be a type constructor corresponding to the hom-set,)

  let id n = make n n (fun i j -> if i = j then R.one else R.zero)

  let compose f g = 
    assert (cod f = dom g);
    let data i j = 
      let rec loop k acc = 
        if k < cod f then
          loop (k+1) R.(acc ++ (get f i k * get g k j))
        else
          acc
      in
      loop 0 R.zero
    in
    make (dom f) (cod g) data

The identity function takes an integer, and returns a square matrix that is R.one on the diagonal and R.zero off-diagonal. If you want to compose two matrices, the routine checks that the dimension match, and then implements a very naive row-vector times column-vector calculation for each indexing. What makes this naive is that it will get recalculated each time!

  let zero row col = make row col (fun i j -> R.zero)

  let (++) m1 m2 = 
    assert (dom m1 = dom m2);
    assert (cod m1 = cod m2);
    let data i j = R.(++) (get m1 i j) (get m2 i j) in
    make (dom m1) (cod m1) data 
end                           

Finally we can implement the zero matrix operation, and matrix addition, both of which work the expected way.

We will use all these operations to show that square matrices form a Kleene algebra. To show this, we will define a functor Square which is parameterized by

  1. A Kleene algebra R.
  2. An implementation of matrices with element type R.t
  3. A size (i.e., an integer)
module Square(R : KLEENE)
             (M : MATRIX with type elt = R.t)
             (Size : sig val size : int end) : KLEENE with type t = M.t = 
struct
  type t = M.t 

The implementation type is the matrix type M.t, and the semiring operations are basically inherited from the module M:

  let (++) = M.(++)
  let ( * ) = M.compose
  let one = M.id Size.size
  let zero = M.zero Size.size Size.size

The only changes are that one and zero are restricted to square matrices of size Size.size, and so we don't need to pass them dimension arguments.

All of the real work is handled in the implementation of the Kleene star. We can follow the construction very directly. First, we can define a split operation which splits a square matrix into four smaller pieces:

  let split i m = 
    let open M in 
    assert (i >= 0 && i < dom m);
    let k = dom m - i in 
    let shift m x y = fun i j -> get m (i + x) (j + y) in 
    let a = make i i (shift m 0 0) in
    let b = make i k (shift m 0 i) in
    let c = make k i (shift m i 0) in
    let d = make k k (shift m i i) in 
    (a, b, c, d)

This divides a matrix into four, by finding the split point i, and then making a new matrix that uses the old matrix's data with a suitable offset. So b's x-th row and y-th column will be the x-th row and (y+i)-th column of the original, for exmample.

Symmetrically, we need a merge operation.

  let merge (a, b, c, d) = 
    assert (M.dom a = M.dom b && M.cod a = M.cod c);
    assert (M.dom d = M.dom c && M.cod d = M.cod b);
    let get i j = 
      match i < M.dom a, j < M.cod a with
      | true, true   -> M.get a i j 
      | true, false  -> M.get b i (j - M.cod a)
      | false, true  -> M.get c (i - M.dom a) j 
      | false, false -> M.get d (i - M.dom a) (j - M.cod a)
    in
    M.make (M.dom a + M.dom d) (M.cod a + M.cod d) get 

To merge four matrices, their dimensions need to match up properly, and then the lookup operation get needs to use the indices to decide which quadrant to look up the answer in.

This can be used to write the star function:

  let rec star m = 
    match M.dom m with
    | 0 -> m 
    | 1 -> M.make 1 1 (fun _ _ -> R.star (M.get m 0 0))
    | n -> let (a, b, c, d) = split (n / 2) m in
           let fstar = star (a ++ b * star d * c) in 
           let gstar = star (d ++ c * star a * b) in 
           merge (fstar, fstar * b * gstar,
                  gstar * c * fstar, gstar)
end

This function looks at the dimension of the matrix. If the matrix is dimension 1, we can just use the underlying Kleene star on the singleton entry in the matrix. If the square matrix is bigger, we can split the matrix in four and then directly apply the construction we saw earlier. We try to divide the four quadrants as nearly in half as we can, because this reduces the number of times we need to make a recursive call.

As an example of using this, let's see how we can convert automata given by state-transition functions to regular expressions.

First, we define a trivial Kleene algebra using a datatype of regular expressions, and use this to set up a module of matrices:

module Re = struct
  type t = 
    | Chr of char
    | Seq of t * t 
    | Eps
    | Bot 
    | Alt of t * t 
    | Star of t 

  let zero = Bot
  let one = Eps
  let ( * ) r1 r2 = Seq(r1, r2)
  let ( ++ ) r1 r2 = Alt(r1, r2)
  let star r = Star r 
end

module M = Matrix(Re)

Then, we can construct a Kleene algebra of matrices of size 2, and then give a 2-state automaton by its transition function.

module T = Square(Re)(M)(struct let size = 2 end)
let t = 
  M.make 2 2 (fun i j -> 
      Re.Chr (match i, j with
          | 0, 0 -> 'a'
          | 0, 1 -> 'b'
          | 1, 0 -> 'c'
          | 1, 1 -> 'd'
          | _ -> assert false))

Now, we can use the Kleene star on T to compute the transitive closure of this transition function:

# M.get tstar 0 1;;
- : M.elt =
Seq (Star (Alt (Chr 'a', Seq (Chr 'b', Seq (Star (Chr 'd'), Chr 'c')))),
 Seq (Chr 'b',
  Star (Alt (Chr 'd', Seq (Chr 'c', Seq (Star (Chr 'a'), Chr 'b'))))))

Voila! A regexp corresponding to the possible paths from 0 to 1.

The code is available as a Github gist. Bugfixes welcome!

Friday, August 23, 2019

New Draft Paper: Survey on Bidirectional Typechecking

Along with J. Dunfield, I have written a new survey paper on bidirectional typechecking.

Bidirectional typing combines two modes of typing: type checking, which checks that a program satisfies a known type, and type synthesis, which determines a type from the program. Using checking enables bidirectional typing to break the decidability barrier of Damas-Milner approaches; using synthesis enables bidirectional typing to avoid the large annotation burden of explicitly typed languages. In addition, bidirectional typing improves error locality. We highlight the design principles that underlie bidirectional type systems, survey the development of bidirectional typing from the prehistoric period before Pierce and Turner's local type inference to the present day, and provide guidance for future investigations.

This is the first survey paper I've tried to write, and an interesting thing about the process is that it forces you to organize your thinking, and that can lead to you changing your mind. For example, before we started this paper I was convinced that bidirectional typechecking was obviously a manifestation of focalization. When we attempted to make this argument clearly in the survey, that led me to realize I didn't believe it! Now I think that the essence of bidirectional typechecking arises from paying close attention to the mode discipline (in the logic programming sense) of the typechecking relations.

Also, if you think we have overlooked some papers on the subject, please let us know. Bidirectional typechecking is a piece of implementation folklore that is perhaps over three decades old, which makes it very hard for us to feel any certainty about having found all the relevant literature.

Wednesday, August 21, 2019

On the Relationship Between Static Analysis and Type Theory

Some months ago, Ravi Mangal sent me an email with a very interesting question, which I reproduce in part below:

The question that has been bothering me is, "How are static analyses and type systems related?". Besides Patrick Cousot's Types as abstract interpretations, I have failed to find other material directly addressing this. Perhaps, the answer is obvious, or maybe it has just become such a part of the community's folklore that it hasn't been really written down. However, to me, it's not clear how one might even phrase this question in a precise mathematical manner.

The reason this question is so interesting is because some very beautiful answers to it have been worked out, but have for some reason never made it into our community's folklore. Indeed, I didn't know the answers to this question until quite recently.

The short answer is that there are two views of what type systems are, and in one view type systems are just another static analysis, and in another they form the substrate upon which static analyses can be built -- and both views are true at the same time.

Static Analysis (including Types) as Properties

One valid view of a type system, is that it is just another species of static analysis. At a high level, the purpose of a static analysis is to answer the question, "Can I tell what properties this program satisfies without running it?"

Abstract interpretation, model checking, dataflow analysis and type inference are all different ways of answering this question, each with somewhat different emphases and priorities. IMO, to understand the relations between these styles of analysis, it's helpful to think about the two main styles of semantics: operational semantics and denotational semantics.

A denotational semantics turns a program into a mathematical object (sets, domains, metric spaces, etc). This has the benefit that it lets you turn the full power of mathematics towards reasoning about programs, but it has the cost that deducing properties of arbitrary mathematical gadgets can require arbitrary mathematical reasoning.

Operational semantics, on the other hand, turns a program into a state machine. This has the benefit that this is relatively easy to do, and close to how computers work. It has the cost that it only gives semantics to whole programs, and in reality people tend to write libraries and other partial programs. (Also, infinite-state operational semantics doesn't make things obviously easier to work with than a denotational semantics.)

Operational semantics gives rise to model checking: if you have a program, you can view the sequence of states it takes on as a concrete model of temporal logic. So model checking then checks whether a temporal properties overapproximates or refines a concrete model.

So basically all static analyses work by finding some efficiently representable approximation of the meaning of a program, and then using that to define an algorithm to analyse the program, typically by some form of model checking. For type systems, the approximations are the types, and in abstract interpretation, this is what the abstract domains are.

David A. Schmidt's paper Abstract Interpretation from a Denotational Semantics Perspective shows how you can see abstract domains as arising from domain theory, by noting that domains arise as the infinite limit of a series of finite approximations, and so you can get an abstract domain by picking a finite approximation rather than going all the way to the limit.

Thomas Jensen's PhD thesis, Abstract Interpretation in Logical Form, shows how an abstract interpretation of a higher-order functional language corresponds to a certain intersection type system for it. This helps illustrate the idea that abstract domains and types are both abstractions of a program. (FWIW, finding equivalences between abstract domains and type systems used to be big topic back in the 90s, but seems to have disappeared from view. I don't know why!)

Now, model checking an infinite-state system will obviously fail to terminate in general. However, it can then be made effective and algorithmic by using your abstractions (either from types or from abstract interpretation) to turn the model into a finitary one that can be exhaustively searched.

For dataflow analysis, David A. Schmidt works out this idea in detail in his paper, Data Flow Analysis is Model Checking of Abstract Interpretations, where he shows how if you build an abstraction of program traces and model-check against mu-calculus formulas, you can get back most classical dataflow analyses. (If one lesson you draw is that you should read everything David Schmidt has written, well, you won't be wrong.)

It's less common to formulate type-based analyses in this form, but some people have looked at it.

  1. There is a line of work on "higher-order model checking" which works by taking higher-order programs, using type inference for special type systems to generate abstractions for them which lie in special subclasses of pushdown systems which are decidable for the properties of interest for program analysis (eg, emptiness testing). Luke Ong wrote an overview of the area for LICS 2015, in his paper Higher-Order Model Checking: an Overview.

  2. Dave Clarke and Ilya Sergey have a nice IPL 2012 paper, A correspondence between type checking via reduction and type checking via evaluation, in which they show how you can view types as abstract values, and use this perspective to derive an abstract interpreter for a program which operates on these abstract values. They don't state it explicitly, since it was too obvious to mention, but you can check that a program e has the type int by asserting that e will eventually evaluate to the abstract value int. This is easy to check since every program in their abstract semantics terminates -- and so you can just evaluate e and see if the result is int.

Types as Grammar

However, while types can be seen as a way of stating program properties, they have a second life as a way of defining what the valid programs are. A type system can also be seen as a way of statically ruling out certain programs from consideration. That is, an expression like 3("hi"), which applies the argument "hi" to the number 3, is valid Javascript or Python code, but is not a valid program in OCaml or Java, as it would be rejected at compile time by the typechecker.

As a result, type systems can also be used to simplify the semantics: if a term is statically ruled out, you don't have to give semantics to it. The benefit of doing this is that this can dramatically simplify your static analyses.

For example, if you have a typed functional language with integers and function types, then building an abstract interpretation to do a sign analysis is simplified by the fact that you don't have to worry about function values flowing into the integer domain. You can just take the obvious abstract domains for each type and then rely on the type system to ensure that only programs where everything fits together the way you want are ever allowed, freeing you of the need for a jumbo abstract domain that abstracts all values.

For cultural reasons, this methodology is often used in type systems research, but relatively rarely in static analysis work. Basically, if you tell a type systems researcher that you are changing the language to make it easier to analyze, they will say "oh, okay", but if you tell a static analysis researcher the same thing, they may express a degree of skepticism about your ability to update all Java code everywhere. Naturally, neither reaction is right or wrong: it's contextual. After all, there are billions of lines of Java code that won't change, and also the Java language does change from version to version.

However, as researchers we don't just pick up context from our problems, but also from our peers, and so the culture of type systems research tends to privilege new languages and the culture of static analysis research tends to privilege applicability to existing codebases, even though there is no fundamental technical reason for this linkage. After all, all static analyses exploit the structure of the language whenever they can in order to improve precision and scalability.

For example, in a typical intraprocedural dataflow analysis, we initialize all the local variables to the bottom element of the dataflow lattice. This is a sound thing to do, because the language is constrained so that local variables are freshly allocated on each procedure invocation. If, on the other hand, we were doing dataflow analysis of assembly code, different subroutines would all simultaneously affect the dataflow values assigned to each register.

As a consequence, one thing I'm looking forward to is finding out what static analysis researchers will do when they start taking a serious look at Rust. Its type system enforces really strong invariants about aliasing and data races, which ought to open the door to some really interesting analyses.

The Relationship Between the Two Views

As you can see, grammaticality constraints and program analysis are not opposed, and can overlap in effect. Furthermore, this means people writing papers on type systems are often blurry about which perspective they are taking -- in one section of a paper they may talk about types-as-properties, and in another they may talk about types-as-grammar. This ambiguity is technically harmless, but can be confusing to a reader coming to type systems from other areas within formal methods.

The reason this ambiguity is mostly harmless is because for well-designed languages there will be a nice relationship between these two views of what types are. John C Reynolds explained this, in his lovely paper The Meaning of Types: From Intrinsic to Extrinsic Semantics. (In his language, "intrinsic types" are types-as-grammar, and "extrinsic types" are types-as-property.)

Basically, he noted that when you give a denotational semantics under a types-as-grammar view, you interpret typing derivations, and when you give a denotational semantics in the types-as-property view, you interpret program terms. For types-as-grammar to make sense, its semantics must be coherent: two typing derivations of the same program must have the same semantics. (Otherwise the semantics of a program can be ambiguous.) But once coherence holds, it becomes feasible to prove the equivalence between the intrinsic and extrinsic semantics, because there is a tight connection between the syntax of the language and the typing derivations.

Noam Zeilberger and Paul-Andre Melliès wrote a POPL 2015 paper, Functors are Type Refinement Systems, where they work out the general semantic structure underpinning Reynolds' work, in terms of the relationship between intersection types (or program logic) and a base language.

Compositionality

Another cultural effect is the importance type systems research places on compositionality. It is both absolutely the case that there is no a priori reason a type system has to be compositional, and also that most type systems researchers -- including myself! -- will be very doubtful of any type system which is not compositional.

To make this concrete, here's an example of a type system which is pragmatic, reasonable, and yet which someone like me would still find distasteful.

Consider a Java-like language whose type system explicitly marks nullability. We will say that method argument (x : C) means that x is a possibly null value of type C, and an argument x : C! means that x is definitely of class C, with no null value. Now, suppose that in this language, we decide that null tests should refine the type of a variable:

                       // x : C outside the if-block
  if (x !== null) { 
    doSomething(x);    // x : C! inside the if-block
  } 

Basically, we checked that x was not null, so we promoted its type from C to C!. This is a reasonable thing to do, which industrial languages like Kotlin support, and yet you will be hard-pressed to find many formal type theories supporting features like this.

The reason is that the culture of type theory says that type systems should satisfy a substitution principle: a variable stands in for a term, and so substituting a term of the same type as the variable should result in a typeable term. In this case, suppose that e is a very complicated expression of type C. Then if we do a substitution of e for x, we get:

  if (e !== null) { 
    doSomething(e);    // e : C! -- but how?
  } 

Now we somehow have to ensure that e has the type C! inside the scope of the conditional. For variables this requires updating the context, but for arbitrary terms it will be much harder, if it is even possible at all!

The culture of static analysis says that since it's easy for variables, we should handle that case, whereas the culture of type theory says that anything you do for variables should work for general terms. Again, which view is right or wrong is contextual: it's genuinely useful for Kotlin programmers to have dataflow-based null-tracking, but at the same time it's much easier to reason about programs if substitution works.

The reason type theorists have this culture is that type theory has two parents. On one side are Scott and Strachey, who consciously and explicitly based denotational semantics upon the compositionality principle; and the other side comes from the structural proof theory originated by Gerhard Gentzen, which was invented precisely to prove cut-eliminability (which basically just is a substitution principle). As a result, the type-theoretic toolbox is full of tools which both assume and maintain compositionality. So giving that up (or moving past it, depending on how you want to load the wording emotionally) is something which requires very careful judgement.

For example, the Typed Racket language extends the dynamically-typed Racket language with types, and to support interoperation with idiomatic Racket code (which uses control-flow sensitive type tests), they do flow-sensitive typing. At ESOP 2011, Arjun Guha, Claudiu Saftoiu, and Shriram Krishnamurthi wrote a paper about a particularly elegant way of combining flow analysis and typechecking, Typing Local Control and State Using Flow Analysis.

Conclusion

Finally, one might wonder whether it is worth it to try developing a unified perspective on type theory and static analysis. It's a lot of work to learn any one of these things, and (for example) David Schmidt's papers, while deeply rewarding, are not for the faint of heart. So maybe we should all specialize and not worry about what members of cousin tribes are doing?

All I can say is that I think I benefited tremendously from the fact that (largely by accident) I ended up learning separation logic, denotational semantics and proof theory as part of my PhD. There actually is a wholeness or consilience to the variety of techniques that the semantics community has devised, and insights apply across domains in surprising ways.

My recent PLDI paper with Jeremy Yallop is a nice example of this. Very little in this paper is hard, but what makes it easy is that we felt free to switch perspectives as we needed.

Because I had learned to like automata theory from reading papers on model checking, I ended up reading about Brüggemann-Klein and Wood's characterization of the deterministic regular languages. Then, because I share the aesthetics of denotational semantics, I realized that they invented a superior, compositional characterization of the FOLLOW sets used in LL(1) parsing. Then, because I knew about the importance of substitution from type theory, this lead me to switch from the BNF presentation of grammars, to the mu-regular expressions. This let us offer the traditional parser combinator API, while under the hood having a type system which identified just the LL(1)-parseable grammars. Moreoever, the type inference algorithm we use in the implementation is basically a dataflow analysis. And that set things up in a form to which multistage programming techniques were applicable, letting us derive parser combinators that run faster than yacc.

But beyond the direct applications like that, it is just good to be able to go to a conference and be interested by what other people are doing. Everyone attending has developed deep expertise in some very esoteric subjects, and learning enough to learn from them is valuable. Research communities are long-running conversations, and being able to hold up your end of the conversation makes the community work better.

Monday, August 12, 2019

Michael Arntzenius on the Future of Coding Podcast

My PhD student Michael Arntzenius is on the Future of Coding podcast, talking about the things he has learned while developing Datafun. In the course of a PhD, people tend to work on extremely refractory problems, which is a useful discipline for developing and sharpening your worldview. Since Michael is a particularly thoughtful person, he's developed a particularly thoughtful view of computing, and so the interview is well worth reading or listening to.

This episode explores the intersections between various flavors of math and programming, and the ways in which they can be mixed, matched, and combined. Michael Arntzenius, "rntz" for short, is a PhD student at the University of Birmingham building a programming language that combines some of the best features of logic, relational, and functional programming. The goal of the project is "to find a sweet spot of something that is more powerful than Datalog, but still constrained enough that we can apply existing optimizations to it and imitate what has been done in the database community and the Datalog community." The challenge is combining the key part of Datalog (simple relational computations without worrying too much underlying representations) and of functional programming (being able to abstract out repeated patterns) in a way that is reasonably performant.

This is a wide-ranging conversation including: Lisp macros, FRP, Eve, miniKanren, decidability, computability, higher-order logics and their correspondence to higher-order types, lattices, partial orders, avoiding logical paradoxes by disallowing negation (or requiring monotonicity) in self reference (or recursion), modal logic, CRDTS (which are semi-lattices), and the place for formalism is programming. This was a great opportunity for me to brush up on (or learn for the first time) some useful mathematical and type theory key words. Hope you get a lot out of it as well -- enjoy!

I'd like to thank Steve Krouse for interviewing Michael. This interview series is a really nice idea -- a lot of the intuition or worldview underpinning research programs never makes it into a paper, but can be brought out nicely in a conversation.

Also, if you prefer reading to listening, the link also has a a transcript of the podcast available.

Thursday, July 11, 2019

All Comonads, All The Time

(If you are on the POPL 2020 program committee, please stop reading until after your reviews are done.)

It's been a few months since I've last blogged, so I figure I should say what I have been up to lately. The answer seems to be comonads. I have two new draft papers, both of which involve comonadic type theories in a fairly critical way. (I suppose that this makes me a type-theoretic hipster, who doesn't do monads now that they're popular.)

Monday, May 13, 2019

Implementing Inverse Bidirectional Typechecking

In my last post, I remarked that the inverse bidirectional type system was obviously algorithmic. In

In my last post, I remarked that the inverse bidirectional type system was obviously algorithmic. In this post, let's implement it! What follows is a bit of OCaml code implementing the type system of the previous post.

First, let's give a data type to represent the types of the linear type system. As usual, we will have a datatype tp with one constructor for each grammatical production. In the comment next to each constructor, I'll give the term that the constructor corresponds to.

type tp = 
  | One                (* represents 1     *)
  | Tensor of tp * tp  (* represents A ⊗ B *) 
  | Lolli of tp * tp   (* represents A ⊸ B *) 

Now, we can give a datatype to represent expressions. We'll represent variables with strings, and use the datatype exp to represent expressions. As before, there is a comment connecting the datatype to the expressions of the grammar.

type var = string

type exp = 
  | Unit                               (* represents ()                  *)
  | LetUnit of exp * exp               (* represents let () = e in e'    *)
  | Pair of exp * exp                  (* represents (e, e')             *)
  | LetPair of var * var * exp * exp   (* represents let (x,y) = e in e' *)
  | Lam of var * exp                   (* represents λx. e               *)
  | App of exp * exp                   (* represents e e'                *)
  | Var of var                         (* represents x                   *)

Now we have to do something annoying, and implement some functions on the option datatype which really should be in the standard library. Basically we just want the standard functional programming structure on option types -- folds, maps, and monadic structure -- so we just go ahead an implement it.

module Option = struct
  type 'a t = 'a option

  let map f = function
    | None -> None
    | Some x -> Some (f x)


  let return x = Some x 

  let fail = None

  let (>>=) m f = 
    match m with
    | None -> None
    | Some x -> f x

   let fold some none = function
     | None -> none
     | Some x -> some x  
end

Now, we can actually implement the bidirectional typechecker. To understand the implementation, it's actually helpful to understand the interface, first.

module type TYPING =  sig
  type ctx = (var * tp option) list
  type 'a t = ctx -> ('a * ctx) option 

  val map : ('a -> 'b) -> 'a t -> 'b t 
  val return : 'a -> 'a t
  val ( >>= ) : 'a t -> ('a -> 'b t) -> 'b t

  val synth : exp -> tp t
  val check : exp -> tp -> unit t

The basic structure of our typechecker is to give a pair of operations check and synth, which respectively check that an expression e has a type tp, and infer a type for an expression. This function will be written in a monadic style, so we also have a type constructor 'a t for typechecking computations, and the usual assortment of functorial (map) and monadic (return and >>=) structure for this type.

The monadic type constructor 'a t is a pretty basic state-and-exception monad. It plumbs the context (of type ctx) through the computation, and can either return a value and an updated context, or it will fail.

An interesting feature of this context representation is that it does not map variables to types – it maps them to the option type tp option. This is because of the way that the moding will work out; the type is an output of the typing relation, and so when we put a variable into the context, we will not give it a type, and use the computation to ascribe it a type, which will be reflected in the output context. This is also why we use a full state monad rather than a reader monad for the context – we are basically implementing part of Prolog's substitution threading here.

We will also need a number of operations to implement the typechecker.

  val fail : 'a t
  val checkvar : var -> tp -> unit t
  val lookup : var -> tp t 
  val withvar : var -> 'a t -> 'a t
  val tp_eq : tp -> tp -> unit t
end

We will need to fail in order to judge programs ill-typed. The checkvar x tp operation gives the variable x the type tp. The lookup x operation will look in the context to find a a type for x, failing if x has not yet been given a type. The operation withvar x m will run the monadic computation m in a context extended with the variable x. (No type is given for the variable, because it's the job of m to give the variable a type.) Finall, there's an equality test tp_eq tp1 tp2, that acts as a guard, failing if the two arguments are unequal.

Now, we can move on to the actual implementation.

module Typing : TYPING = struct
  type ctx = (var * tp option) list 

  type 'a t = ctx -> ('a * ctx) option 

  let map f m ctx = 
    let open Option in 
    m ctx >>= fun (x, ctx) -> 
    return (f x, ctx)

  let return x = fun ctx -> Some(x, ctx)

  let (>>=) m f = fun ctx -> 
    let open Option in 
    m ctx >>= fun (a, ctx') -> 
    f a ctx'

As promised, the computation type is a state-and-exception monad, and the implementation of map and the monadic unit and bind are pretty unsurprising. More interesting are the implementations of the actual operations in the monadic interface.

  let fail : 'a t = fun ctx -> None 

Failure is easy to implement – it just ignores the context, and then returns None.

  let rec checkvar (x : var) (tp : tp) : unit t = fun ctx -> 
    let open Option in 
    match ctx with
    | [] -> fail 
    | (y, None) :: rest when x = y -> return ((), (y, Some tp) :: rest)
    | (y, Some _) :: rest when x = y -> fail 
    | h :: rest -> checkvar x tp rest >>= fun ((), rest') -> 
                   return ((), h :: rest')

The way that checkvar x tp works is that it iterates through the variables in the context, looking for the hypothesis which matches the variable x. When it finds it, it returns an updated context with the type of x set to Some tp. If the variable is already set, then that means that this is the second use of the variable, and so checkvar fails – this enforces the property that variables are used at most one time. If the variable isn't in the context, then checkvar also fails, because this is an out-of-scope variable reference. All other hypotheses are left unchanged.

  let lookup x (ctx : ctx) = 
    match List.assoc_opt x ctx with
    | None -> Option.fail
    | Some None -> Option.fail
    | Some (Some tp) -> Option.return(tp, ctx)

The lookup x computation is even simpler – it returns tp if (x, Some tp) is in the context, and fails otherwise.

  let withvar (type a) (x : var) (m : a t) : a t = fun ctx -> 
    let open Option in 
    m ((x, None) :: ctx) >>= function
    | (r, (y, Some _) :: ctx') when x = y -> return (r, ctx') 
    | (r, (y, None) :: ctx') when x = y -> fail 
    | _ -> assert false

The withvar x m operation extends the context with the variable x, and then runs m in the extended context.

An invariant our context representation maintains is that the output context has exactly the same variables in exactly the same order as the input context, and so we just pop off the first variable of the output context before returning, checking to make sure that the type of the variable has been set (i.e., Some _) to ensure that the variable was used at least one time. In conjunction with checkvar ensuring that the variable is used at most one time, this will ensure that each variable is used exactly one time.

If the first variable of the output context isn't x, or if the output context is empty, then our invariant is broken, and so we signal an assertion failure.

  let tp_eq (tp1 : tp) (tp2 : tp) = if tp1 = tp2 then return () else fail 

The type_eq tp1 tp2 function just turns a boolean test into a guard. Now, we can go through the synthesis and checking functions clause-by-clause:

  let rec synth = function
    | Unit -> return One

We synthesize the unit type for the unit value.

    | Pair(e1, e2) -> synth e1 >>= fun tp1 -> 
                      synth e2 >>= fun tp2 -> 
                      return (Tensor(tp1, tp2))

To synthesize a type for a pair, we synthesize types for each of the components, and then return their tensor product.

    | Lam(x, e) -> withvar x (synth e >>= fun ret_tp -> 
                              lookup x >>= fun arg_tp -> 
                              return (Lolli(arg_tp, ret_tp)))

Functions are interesting, because we need to deal with variables, and evaluation order plays out in a neat way here. We infer a type ret_tp for the body e, and then we look up the type tp_arg that the body e ascribed to the variable x. This lets us give a type Lolli(tp_arg, tp_ret) for the whole function.

    | LetUnit(e, e') -> check e One >>= fun () -> 
                        synth e'

To synthesize a type for unit elimination, we synthesize a type for the body, and check that the scrutinee has the unit type One.

    | LetPair(x, y, e, e') -> 
       withvar y (withvar x (synth e' >>= fun res_tp -> 
                             lookup x >>= fun tp1 -> 
                             lookup y >>= fun tp2 -> 
                             check e (Tensor(tp1, tp2)) >>= fun () -> 
                             return res_tp))

To eliminate a pair, we introduce (using withvar) scopes for the variables x and y, and then:

  1. We synthesize a type res_tp for the continuation e'.
  2. Since e' used x and y, we can look up the types they were used at (binding the type of x to tp1 and the type of y to tp2).
  3. Then, we check that the scrutinee e has the type Tensor(tp1, tp2).
  4. Finally, we return the type res_tp for the type of the whole expression.
    | App(_, _) -> fail 
    | Var _ -> fail 

Since applications and variable references are checking, not synthesizing, we fail if we see one of them in synthesizing position. If they are in checking position, we can use the check function to typecheck them:

  and check (e : exp) (tp : tp) : unit t = 
    match e with 
    | Var x -> checkvar x tp 

The variable case simply uses checkvar.

    | App(e1, e2) -> synth e2 >>= fun tp_arg -> 
                     check e1 (Lolli(tp_arg, tp))

To check an application e1 e2 at a type tp, we first synthesize the argument type by inferring a type tp_arg for e2, and then we check that e1 has the function type Lolli(tp_arg, tp).

    | e -> synth e >>= tp_eq tp
end

Finally, when we find a synthesizing term in checking position, we infer a type for it and then see if it is equal to what we expected.

This code is, at-best, lightly-tested, but I knocked together a small Github repository with the code. Enjoy!

Friday, May 10, 2019

Inverting Bidirectional Typechecking

\[\newcommand{\bnfalt}{\;\;|\;\;} \newcommand{\To}{\Rightarrow} \newcommand{\From}{\Leftarrow} \newcommand{\rule}[2]{\frac{\displaystyle \begin{matrix} #1 \end{matrix}}{\displaystyle #2}} \newcommand{\check}[3]{{#1} \vdash {#2} \From {#3}} \newcommand{\synth}[3]{{#1} \vdash {#2} \To {#3}} \newcommand{\lam}[1]{\lambda {#1}.\,} \newcommand{\inj}[2]{\mathsf{in}_{#1}({#2})} \newcommand{\case}[5]{\mathsf{case}({#1}, \inj{1}{#2} \to {#3}, \inj{2}{#4} \to {#5})} \newcommand{\match}[2]{\mathsf{match}\;{#1}\;\mathsf{of}\;[ #2 ]} \newcommand{\arm}[2]{{#1} \to {#2}} \newcommand{\downshift}[1]{\downarrow{#1}} \newcommand{\upshift}[1]{\uparrow{#1}} \newcommand{\thunk}[1]{\left\{{#1}\right\}} \newcommand{\return}[1]{\mathsf{return}\;#1} \newcommand{\fun}[1]{\lambda\;#1} \newcommand{\checkn}[3]{{#1} \rhd {#2} \From {#3}} \newcommand{\checkp}[2]{{#1} \leadsto {#2}} \newcommand{\spine}[4]{{#1} \vdash {#2} : {#3} \gg {#4}} \newcommand{\tensor}{\otimes} \newcommand{\lolli}{\multimap} \newcommand{\letv}[2]{\mathsf{let}\;{#1}={#2}\;\mathsf{in}\;} \newcommand{\unit}{\langle\rangle} \newcommand{\letunit}[1]{\letv{\unit}{#1}} \newcommand{\pair}[2]{\left\langle{#1}, {#2}\right\rangle} \newcommand{\letpair}[3]{\letv{\pair{#1}{#2}}{#3}}\]

In the traditional recipe for bidirectional typechecking, introduction forms are checked, and the principal subterm of elimination forms are inferred. However, a while back Noam Zeilberger remarked to me that in multiplicative linear logic, bidirectional typechecking worked just as well if you did it backwards. It is worth spelling out the details of this remark, and so this blog post.

First, let's give the types and grammar of multiplicative linear logic.

\[ \begin{array}{llcl} \mbox{Types} & A & ::= & 1 \bnfalt A \tensor B \bnfalt A \lolli B \\ \mbox{Terms} & e & ::= & x \bnfalt \lam{x}{e} \bnfalt e\,e' \\ & & | & \unit \bnfalt \letunit{e}{e'} \\ & & | & \pair{e}{e'} \bnfalt \letpair{x}{y}{e}{e'} \\ \mbox{Contexts} & \Gamma & ::= & \cdot \bnfalt \Gamma, x \From A \\ \end{array} \]

Our types are the unit type \(1\), the tensor product \(A \tensor B\), and the linear function space \(A \lolli B\). The unit and pair have the expected introduction forms \(\unit\) and \(\pair{e}{e'}\), and they have "pattern matching" style elimination forms. Functions are introduced with lambdas \(\lam{x}{e}\) and eliminated with applications \(e\,e'\) as usual, and of course variable references \(x\) as usual. Contexts are a bit unusual -- they pair together variables and their types as usual, but instead of treating a variable as a placeholder for a synthesizing term, we treat variables as placeholders for checking terms.

Now, let's go through the typing rules. First, we give the introduction and elimination rules for the unit type.

\[ \begin{array}{ll} \rule{ } { \synth{\cdot}{\unit}{1} } & \rule{ \synth{\Delta}{e'}{A} & \check{\Gamma}{e}{1} } { \synth{\Gamma, \Delta}{\letunit{e}{e'}}{A} } \\[1em] \end{array} \]

The introduction rule says that in an empty context, the unit value \(\unit\) synthesizes the type \(1\). The pattern-matching style elimination \(\letunit{e}{e'}\) typechecks as follows. First, we infer a type \(A\) for the body \(e'\), and then we check that the scrutinee \(e\) has the unit type \(1\).

This order is backwards from traditional bidirectional systems -- we synthesize a type for the continuation first, before checking the type of the data we are eliminating. In the case of units, this is a mere curiosity, but it gets more interesting with the tensor product type \(A \tensor B\).

\[ \begin{array}{ll} \rule{ \synth{\Gamma}{e}{A} & \synth{\Delta}{e'}{B} } { \synth{\Gamma, \Delta}{\pair{e}{e'}}{A \tensor B} } & \rule{ \synth{\Gamma, x \From A, y \From B}{e'}{C} & \check{\Delta}{e}{A \tensor B} } { \synth{\Gamma, \Delta}{\letpair{x}{y}{e}{e'}}{C} } \end{array} \]

Now, the synthesis for pairs remains intuitive. For a pair \(\pair{e}{e'}\), we first infer a type \(A\) for \(e\), and a type \(B\) for \(e'\), and then conclude that the pair has the type \(A \tensor B\). However, the typing of the pair elimination \(\letpair{x}{y}{e}{e'}\) is much wilder.

In this rule, we first check that the continuation \(e'\) has the type \(C\), and then we learn from typechecking \(e'\) that \(x\) and \(y\) were required to have had types \(A\) and \(B\) respectively. This gives us the data that we need to check that \(e\) has the type \(A \tensor B\).

The linear function type \(A \lolli B\) has a similar character:

\[ \begin{array}{ll} \rule{ \synth{\Gamma, x \From A}{e}{B} } { \synth{\Gamma}{\lam{x}{e}}{A \lolli B} } & \rule{ \synth{\Gamma}{e'}{A} & \check{\Delta}{e}{A \lolli B} } { \check{\Gamma, \Delta}{e\,e'}{B} } \end{array} \]

Here, to infer at type for the introduction form \(\lam{x}{e}\), we infer a type \(B\) for the body \(e\), and then look up what type \(A\) the parameter \(x\) was required to be for the body to typecheck. To check that an application \(e\,e'\) has the type \(B\), we infer a type \(A\) for the argument \(e'\), and then check that the function \(e\) has the function type \(A \lolli B\).

Again, the checking/synthesis mode of thse rules are precisely reversed from usual bidirectional type systems. We can see how this reversal plays out for variables below:

\[ \begin{array}{ll} \rule{ } { \check{x \From A}{x}{A} } & \rule{ \synth{\Gamma}{e}{A} & A = B} { \check{\Gamma}{e}{B} } \end{array} \]

Here, when we check that the variable \(x\) has the type \(A\), the context must be such that it demands \(x\) to have the type \(A\). (However, the switch between checking and synthesis is the same as ever.)

If you are used to regular bidirectional systems, the information flow in the variable rule (as well as for pattern matching for pairs and lambda-abstraction for functions) is a bit unexpected. We are used to having a context tell us what types each variable has. However, in this case we are not doing that! Instead, we use it to record the types that the rest of the program requires
variables to have.

This is still a "well-moded" program in the sense of logic programming. However, the moding is a bit more exotic now -- within a context, the variables are inputs, but their types are outputs. This is a bit fancier than the mode systems that usual logic programming languages have, but people have studied mode systems which can support this (such as Uday Reddy's A Typed Foundation for Directional Logic Programming).

As far as the metatheory of this system goes, I don't know very much about it. Substitution works fine -- you can easily prove a theorem of the form:

Theorem (Substitution) If \(\check{\Delta}{e}{A}\), then

  1. If \(\check{\Gamma, x \From A, \Theta}{e'}{C}\) then \(\check{\Gamma, \Delta, \Theta}{[e/x]e'}{C}\).
  2. If \(\synth{\Gamma, x \From A, \Theta}{e'}{C}\) then \(\synth{\Gamma, \Delta, \Theta}{[e/x]e'}{C}\)

However, I don't presently know a good characterization of the kind of terms are typable under this discipline. E.g., in the standard bidirectional presentation, the annotation-free terms are precisely the \(\beta\)-normal terms. However, in the reverse bidirectional system, that is not the case.

Two papers that seem closely related to this system are:

  1. Adam Chlipala, Leaf Petersen, and Robert Harper's TLDI 2005 paper, Strict Bidirectional Type Checking, and

  2. Ningning Xie and Bruno C. d. S. Oliveira's ESOP 2018 paper, Let Arguments Go First.

Adam and company's paper includes the traditional synthesizing bidirectional hypotheses, as well as checking hypotheses very similar to the ones in this post, but inspired by relevance logic rather than linear logic. The basic idea is that if a hypothesis is relevant, then it is okay to let checking determine its type, since we are guaranteed that the variable will appear in some checking context (which will tell us what type it should have). The same idea applies here, since linearity necessarily implies relevance.

Ningning and Bruno's paper has an application rule that looks exactly like the one in this paper -- argument types are synthesized, which permits inferring the type of a function head in a checking context. However, their system is focused on inferring polymorphic types, which makes the precise relationship a bit unclear to me.

The implementation of reverse bidirectionality is just as easy as traditional bidirectional systems, but I will leave that for my next post.