Wednesday, October 30, 2019

Every Finite Automaton has a Corresponding Regular Expression

Today I'd like to blog about the nicest proof of Kleene's theorem that regular expressions and finite automata are equivalent that I know. One direction of this proof -- that every regular expression has a corresponding finite automaton -- is very well-known to programmers, since Thompson's construction, which converts regular expressions to nondeterministic automta, and the subset construction, which turns nondeterministic automta into deterministic automta, are the basis of every lexer and regexp engine there is.

However, the proof of the other direction -- that every automaton has a corresponding regexp -- is often stated, but seldom proved. This is a shame, since it has an elegant formulation in terms of linear algebra(!).

The Construction

FIrst, let's assume we have a regular language \((R, 1, \cdot, 0, +, \ast)\). Now, define the category \(C\) as follows:

  • The objects of \(C\) are the natural numbers.
  • A morphism \(f : n \to m\) is an \(n \times m\) matrix of elements of \(R\). We will write \(f_{(i,j)}\) to get the element of \(R\) in the \(i\)-th row and \(j\)-th column.
  • The identity morphism \(id : n \to n\) is the \(n \times n\) matrix which is \(1\) on the diagonal and \(0\) off the diagonal
  • The composition \(f;g\) of \(f : n \to m\) and \(g : m \to o\) is given by matrix multiplication:

\[ (f;g)_{i, j} = \sum_{k \in 0\ldots m} f_{(i,k)} \cdot g_{(k,j)} \]

This may seem like a rather peculiar thing to do -- matrices filled with regular expressions? However, to understand why this is potentially a sensible idea, let's look at a generic automaton of dimension 2.

G 0 0 0->0 a 1 1 0->1 b 1->0 c 1->1 d

Given such a picture, we can give the transition matrix for this as follows:

\[ A = \begin{array}{l|cc} & 0 & 1 \\\hline 0 & a & b \\ 1 & c & d \\ \end{array} \]

So the \((i,j)\)-th entry tells us the label of the edge from node \(i\) to node \(j\). In our category, this is an endomap \(A : 2 \to 2\). What happens if we compose \(A\) with itself? Since composition is just matrix multiplication, we get:

\[ \left[\begin{array}{cc} a & b \\ c & d \end{array}\right] \cdot \left[\begin{array}{cc} a & b \\ c & d \end{array}\right] = \left[\begin{array}{cc} aa + bc & ab + bd \\ ca + dc & cb + dd \end{array}\right] \]

Notice that the \((i,j)\)-th entry is a regular expression describing the paths of length two! So composition of arrows corresponds to path composition. The identity matrix are the paths of length 0, the transition matrix gives the paths of length 1, and matrix multiplication gives path concatentation. So it's easy to get a regexp for any bounded length path. Indeed, it is easy to see that the endomaps of \(n\) form a semiring.

So the endomaps of \(n\) are \(n \times n\) matrixes, and we can turn them into a semiring \((S, 1_S, \cdot_S, 0_S, +_S)\) as follows:

  • The carrier \(S\) is \(\mathrm{Hom}(n,n)\), the set of \(n \times n\) matrices over \(R\).
  • The unit \(1_S\) is the identity map \(\mathrm{id} : n \to n\)
  • The multiplication \(\cdot_S\) is composition of morphisms (i.e., matrix multiplication). Since these are endomorphisms (ie, square matrices), the result of composition (i.e., multiplication of square matrices) is another endomorphism (ie, square matrix of the same dimension).
  • The zero \(0_S\) is the matrix with all 0 entries.
  • The addition \(f + g\) is the pointwise addition of matrices -- i.e., \((f + g)_{i, j} = f_{i,j} + g_{i,j}\)

Now we can ask if we can equip endomaps with a Kleene star operation as well? Recall that the defining equation for the Kleene star is

\[ s\ast = 1 + s\cdot s\ast \]

which says that we are looking for a transitive closure operation. That is, \(s\) consists of 0 length paths, or anything in \(s\) plus anything in \(s\ast\). So we want a definition of the Kleene star which satisfies this equation. Let's consider the case of the size 2 automaton again:

G 0 0 0->0 a 1 1 0->1 b 1->0 c 1->1 d

Now, let's think about the loop-free paths between 0 and 0. There are two classes of them:

  1. The single step \(a\).
  2. We can take a step from 0 to 1 via \(b\), take some number of steps from 1 to itself via \(d\), and then back to 0 via \(c\).

So we can describe the loop-free paths from 0 to 0 via the regular expression

\[ F \triangleq a + b(d\ast)c \]

Similarly, the loop-free paths from 1 to 1 are either:

  1. The single step \(d\).
  2. We can take a step from 1 to 0 via \(c\), take some number of steps from 1 to itself via \(a\), and then back to 1 via \(b\).

So these paths are described by \[ G \triangleq d + c(a\ast)b \]

So the set of all paths from 0 to 0 is \(F\ast\), and the set of all paths from \(1\) to \(1\) is \(G\ast\). This lets us write down the

\[ A\ast = \begin{array}{l|cc} & 0 & 1 \\\hline 0 & F\ast & {F\ast} \cdot b \cdot {G\ast} \\ 1 & {G\ast} \cdot c \cdot {F\ast} & G\ast \\ \end{array} \]

Proving that this is a correct definition requires a bit of work, so I will skip over it. That's because I want to get to the cool bit, which is that this construction is enough to define the Kleene star for any dimension \(n\)! We'll prove this by well-founded induction. First, suppose we have an \(n \times n\) transition matrix \(T\) for \(n > 2\). Then, we can divide the matrix into blocks, with \(A\) and \(D\) as square submatrices:

\[ T = \left[ \begin{array}{c|c} A & B \\\hline C & D \end{array} \right] \]

and then define \(T\ast\) as:

\[ T\ast = \left[ \begin{array}{c|c} F\ast & {F\ast} \cdot B \cdot {G\ast} \\ \hline {G\ast} \cdot C \cdot {F\ast} & G\ast \\ \end{array} \right] \]

where

\[ \begin{array}{lcl} F & \triangleq & A + B(D\ast)C \\ G & \triangleq & D + C(A\ast)B \\ \end{array} \]

By induction, we know that square submatrices of dimension smaller than \(n\) is a Kleene algebra, so we know that \(A\) and \(D\) have Kleene algebra structure. The reason for defining the categorical structure also becomes apparent here -- even though \(A\) and \(D\) are square, \(B\) and \(C\) may not be, and we will need to make use of the fact we defined composition on non-square matrices for them.

References and Miscellany

I have, as is traditional for bloggers, failed to prove any of my assertions. Luckily other people have done the work! Two papers that I particularly like are:

One fact which didn't fit into the narrative above, but which I think is extremely neat, is about the linear-algebraic view of the Kleene star. To see what it is, recall the definition of the matrix exponential for a matrix \(A\):

\[ e^A \triangleq \sum_{k = 0}^\infty \frac{A^k}{k!} = I + A + \frac{A^2}{2!} + \frac{A^3}{3!} + \ldots \]

Now, note that

  1. \(k!\) is an integer
  2. any integer can be expressed as a formal sum \(n = \overbrace{1 + \ldots + 1}^{n \,\mbox{times}}\),
  3. and in regular languages, we have \(r + r = r\).

Hence \(k! = 1\), and since \(A/1 = A\), we have

\[ e^A \triangleq \sum_{k = 0}^\infty A^k = I + A + A^2 + A^3 + \ldots \]

But this is precisely the definition of \(A\ast\)! In other words, for square matrices over Kleene algebras, Kleene iteration is precisely the same thing as the matrix exponential.

The Implementation

The nice thing about this construction is that you can turn it into running code more or less directly, which leads to an implementation of the DFA-to-regexp algorithm in less than 100 lines of code. I'll give an implementation in OCaml.

First, let's give a signature for Kleene algebras. All the constructions we talked about are parameterized in the Kleene algebra, so our implementation will be too.

module type KLEENE = sig
  type t 
  val zero : t 
  val ( ++ ) : t -> t -> t 
  val one : t
  val ( * ) : t -> t -> t 
  val star : t -> t 
end

You can see here that we have zero and one as the unit for the addition ++ and multiplication * respectively, and we also have a star operation representing the Kleene star.

Similarly, we can give a signature for matrices over an element type.

module type MATRIX = sig
  type elt
  type t 

  val dom : t -> int
  val cod : t -> int
  val get : t -> int -> int -> elt

  val id : int -> t 
  val compose : t -> t -> t 

  val zero : int -> int -> t 
  val (++) : t -> t -> t 

  val make : int -> int -> (int -> int -> elt) -> t 
end

So this signature requires an element type elt, a matrix type t, and a collection of operations. We can the number of rows in the matrix with dom and the columns with cod, with the names coming from the thought of dimensions and matrices forming a category. Similarly this gives us a notion of identity with id, which takes an integer and returns the identity matrix of that size. Matrix multiplication is just composition, which leads to the name compose.

We also want the ability to create zero matrices with the zero function, and the ability to do pointwise addition of matrices with the (++) operation. Neither of these operations need square matrices, but to add matrices pointwise requires compatible dimensions.

Finally we want a way to construct a matrix with make, which is given the dimensions and a function to produce the elements.

Now, since the matrix construction was parameterized by a Kleene algebra, we can write a functor Matrix in Ocaml which takes a module of type KLEENE to produce an implementation of the MATRIX signature with elements of the Kleene algebra as the element type:

module Matrix(R : KLEENE) : MATRIX with type elt = R.t = 
struct
  type elt = R.t 
  type t = {
      row : int; 
      col: int; 
      data : int -> int -> R.t
    }

We take the element type to be elements of the Kleene algebra, and take a matrix to be a record with the number of rows, the number of columns, and a function to produce the entries. This is not a high-performance choice, but it works fine for illustration purposes.

  let make row col data = {row; col; data}
  let dom m = m.row
  let cod m = m.col

  let get m i j = 
      assert (0 <= i && i < m.row);
      assert (0 <= j && j < m.col);
      m.data i j 

To make a matrix, we just use the data from the constructor to build the appropriate record, and to get the domain and codomain we just select the row or column fields. To get an element we just call the underlying function, with an assertion check to ensure the arguments are in bounds.

This is a point where a more dependently-typed language would be nice: if the dimensions of the matrix were tracked in the type then we could omit the assertion checks. (Basically, the element type would become a type constructor t : nat -> nat -> type, which would be a type constructor corresponding to the hom-set,)

  let id n = make n n (fun i j -> if i = j then R.one else R.zero)

  let compose f g = 
    assert (cod f = dom g);
    let data i j = 
      let rec loop k acc = 
        if k < cod f then
          loop (k+1) R.(acc ++ (get f i k * get g k j))
        else
          acc
      in
      loop 0 R.zero
    in
    make (dom f) (cod g) data

The identity function takes an integer, and returns a square matrix that is R.one on the diagonal and R.zero off-diagonal. If you want to compose two matrices, the routine checks that the dimension match, and then implements a very naive row-vector times column-vector calculation for each indexing. What makes this naive is that it will get recalculated each time!

  let zero row col = make row col (fun i j -> R.zero)

  let (++) m1 m2 = 
    assert (dom m1 = dom m2);
    assert (cod m1 = cod m2);
    let data i j = R.(++) (get m1 i j) (get m2 i j) in
    make (dom m1) (cod m1) data 
end                           

Finally we can implement the zero matrix operation, and matrix addition, both of which work the expected way.

We will use all these operations to show that square matrices form a Kleene algebra. To show this, we will define a functor Square which is parameterized by

  1. A Kleene algebra R.
  2. An implementation of matrices with element type R.t
  3. A size (i.e., an integer)
module Square(R : KLEENE)
             (M : MATRIX with type elt = R.t)
             (Size : sig val size : int end) : KLEENE with type t = M.t = 
struct
  type t = M.t 

The implementation type is the matrix type M.t, and the semiring operations are basically inherited from the module M:

  let (++) = M.(++)
  let ( * ) = M.compose
  let one = M.id Size.size
  let zero = M.zero Size.size Size.size

The only changes are that one and zero are restricted to square matrices of size Size.size, and so we don't need to pass them dimension arguments.

All of the real work is handled in the implementation of the Kleene star. We can follow the construction very directly. First, we can define a split operation which splits a square matrix into four smaller pieces:

  let split i m = 
    let open M in 
    assert (i >= 0 && i < dom m);
    let k = dom m - i in 
    let shift m x y = fun i j -> get m (i + x) (j + y) in 
    let a = make i i (shift m 0 0) in
    let b = make i k (shift m 0 i) in
    let c = make k i (shift m i 0) in
    let d = make k k (shift m i i) in 
    (a, b, c, d)

This divides a matrix into four, by finding the split point i, and then making a new matrix that uses the old matrix's data with a suitable offset. So b's x-th row and y-th column will be the x-th row and (y+i)-th column of the original, for exmample.

Symmetrically, we need a merge operation.

  let merge (a, b, c, d) = 
    assert (M.dom a = M.dom b && M.cod a = M.cod c);
    assert (M.dom d = M.dom c && M.cod d = M.cod b);
    let get i j = 
      match i < M.dom a, j < M.cod a with
      | true, true   -> M.get a i j 
      | true, false  -> M.get b i (j - M.cod a)
      | false, true  -> M.get c (i - M.dom a) j 
      | false, false -> M.get d (i - M.dom a) (j - M.cod a)
    in
    M.make (M.dom a + M.dom d) (M.cod a + M.cod d) get 

To merge four matrices, their dimensions need to match up properly, and then the lookup operation get needs to use the indices to decide which quadrant to look up the answer in.

This can be used to write the star function:

  let rec star m = 
    match M.dom m with
    | 0 -> m 
    | 1 -> M.make 1 1 (fun _ _ -> R.star (M.get m 0 0))
    | n -> let (a, b, c, d) = split (n / 2) m in
           let fstar = star (a ++ b * star d * c) in 
           let gstar = star (d ++ c * star a * b) in 
           merge (fstar, fstar * b * gstar,
                  gstar * c * fstar, gstar)
end

This function looks at the dimension of the matrix. If the matrix is dimension 1, we can just use the underlying Kleene star on the singleton entry in the matrix. If the square matrix is bigger, we can split the matrix in four and then directly apply the construction we saw earlier. We try to divide the four quadrants as nearly in half as we can, because this reduces the number of times we need to make a recursive call.

As an example of using this, let's see how we can convert automata given by state-transition functions to regular expressions.

First, we define a trivial Kleene algebra using a datatype of regular expressions, and use this to set up a module of matrices:

module Re = struct
  type t = 
    | Chr of char
    | Seq of t * t 
    | Eps
    | Bot 
    | Alt of t * t 
    | Star of t 

  let zero = Bot
  let one = Eps
  let ( * ) r1 r2 = Seq(r1, r2)
  let ( ++ ) r1 r2 = Alt(r1, r2)
  let star r = Star r 
end

module M = Matrix(Re)

Then, we can construct a Kleene algebra of matrices of size 2, and then give a 2-state automaton by its transition function.

module T = Square(Re)(M)(struct let size = 2 end)
let t = 
  M.make 2 2 (fun i j -> 
      Re.Chr (match i, j with
          | 0, 0 -> 'a'
          | 0, 1 -> 'b'
          | 1, 0 -> 'c'
          | 1, 1 -> 'd'
          | _ -> assert false))

Now, we can use the Kleene star on T to compute the transitive closure of this transition function:

# M.get tstar 0 1;;
- : M.elt =
Seq (Star (Alt (Chr 'a', Seq (Chr 'b', Seq (Star (Chr 'd'), Chr 'c')))),
 Seq (Chr 'b',
  Star (Alt (Chr 'd', Seq (Chr 'c', Seq (Star (Chr 'a'), Chr 'b'))))))

Voila! A regexp corresponding to the possible paths from 0 to 1.

The code is available as a Github gist. Bugfixes welcome!

4 comments:

  1. Nice post! A visual version of this proof can be had by deleting nodes in the automaton and inserting the corresponding shortcut edges in order to keep the language between the other nodes the same.

    If we have a node j with incoming edges a_ij and self loop a_jj and outgoing edges a_jk, then we insert shortcut edges (i -> k) with label a_ij a_jj* a_jk.

    After deleting all nodes, except for the chosen start and end node, we will be left with a single edge labeled with the regexp of the language between the start and end node.

    ReplyDelete
  2. I believe your equation (and code) for $T^\ast$ is incorrect. The correct version is Eq (21) in the Kozen paper that you linked to.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. If you want a quick thing to test against, you can compare your $T^\ast$ to $(I - T)^{-1}$ where $T$ a real-valued square matrix.

      Delete