Brzozowski derivatives are one of the shibboleths of functional programming: if you ask someone about implementing regular expressions, and you get back an answer involving derivatives of regular expressions, then you have almost surely identified a functional programmer.
The reason that functional programmers like derivatives so much is that they offer an elegantly algebraic and inductive approach to string matching. The derivative function $\delta_c$ takes a character $c$ and a regular expression $r$, and returns a new regular expression $r'$, which accepts a string $s$ if and only if $r$ accepts $c\cdot s$. This function can be defined as follows: $$ \begin{array}{lcl} \delta_c(c) & = & \epsilon \\ \delta_c(c') & = & \bot \\ \delta_c(\epsilon) & = & \bot \\ \delta_c(r_1 \cdot r_2) & = & \left\{ \begin{array}{ll} \delta_c(r_1) \cdot r_2 \vee \delta_c(r_2) & \mbox{if } r_1 \mbox{ nullable} \\ \delta_c(r_1) \cdot r_2 & \mbox{otherwise} \end{array} \right. \\ \delta_c(\bot) & = & \bot \\ \delta_c(r_1 \vee r_2) & = & \delta_c(r_1) \vee \delta_c(r_2) \\ \delta_c(r\ast) & = & \delta_c(r)\cdot r\ast \end{array} $$
Now we can turn the 1-character derivative into a word derivative as follows: $$ \begin{array}{lcl} \delta_\epsilon(r) & = & r \\ \delta_{c\cdot w}(r) & = & \delta_w(\delta_c(r)) \end{array} $$
Again, this is a simple inductive definition. So to match a string we just call the 1-character derivative for each character, and then check to see if the final regular expression is nullable. However, there's a catch! As you can see from the definition of $\delta_c$, there is no guarantee that the size of the derivative is bounded. So matching a long string can potentially lead to the construction of very large derivatives. This is especially unfortunate, since it is possible (using DFAs) to match strings using regular expressions in constant space.
In his 1962 paper, Brzozowski showed that the number of derivatives of a regular expression is finite, if you quotient by the associativity, commutativity, and idempotence axioms for for union $r_1 \vee r_2$. That is, suppose we consider regular expressions modulo an equivalence relation $\equiv$ such that: $$ \begin{array}{lcl} r \vee r & \equiv & r \\ r \vee \bot & \equiv & r \\ r \vee r' & \equiv & r' \vee r \\ r_1 \vee (r_2 \vee r_3) & \equiv & (r_1 \vee r_2) \vee r_3 \\ \end{array} $$
Then his theorem guarantees that the set of derivatives is finite. However, computing with derivatives up to equivalence is rather painful. Even computing equality and ordering is tricky, since union can occur anywhere in a regular expression, and without that it's difficult to implement higher-level data structures such as sets of regular expressions.
So for the most part, derivatives have remained a minor piece of functional programming folklore: they are cute, but a good implementation of derivatives is not really simple enough to displace the usual Dragon Book automata constructions.
However, in 1995 Valentin Antimirov introduced the notion of a partial derivative of a regular expression. The idea is that if you have a regular expression $r$ and a character $c$, then a partial derivative is a regular expression $r'$ such that if $r'$ accepts a word $s$, then $r$ accepts $c \cdot s$. Unlike a derivative, this is only a if-then relationship, and not an if-and-only-if relationship.
Then, rather than taking regular expressions modulo the ACUI equations, we can construct sets of partial derivatives, which collectively accept the same strings as the Brzozowski derivative. The idea is that we can use the algebraic properties of sets to model the effect of the ACUI equations. Below, I give the partial derivative function: $$ \begin{array}{lcl} \alpha_c(c) & = & \setof{\epsilon} \\ \alpha_c(c') & = & \emptyset \\ \alpha_c(\epsilon) & = & \emptyset \\ \alpha_c(r_1 \cdot r_2) & = & \left\{ \begin{array}{ll} \comprehend{r \cdot r_2}{ r \in \alpha_c(r_1)} \cup \alpha_c(r_2) & \mbox{if } r_1 \mbox{ nullable} \\ \comprehend{r \cdot r_2}{ r \in \alpha_c(r_1)} & \mbox{otherwise} \end{array} \right. \\ \alpha_c(\bot) & = & \emptyset \\ \alpha_c(r_1 \vee r_2) & = & \alpha_c(r_1) \cup \alpha_c(r_2) \\ \alpha_c(r\ast) & = & \comprehend{ r' \cdot r\ast }{ r' \in \alpha_c(r) } \end{array} $$
Partial derivatives can be lifted to words just as ordinary derivatives can be, and it is relatively easy to prove that the set of partial word derivatives of a regular expression is finite. We can show this even without taking a quotient, and so partial derivatives lend themselves even more neatly to an efficient implementation than Brzozowski derivatives do.
I'll illustrate this point by using Antimirov derivatives to construct a DFA-based regular expression matcher in ~50 lines of code. You can find the Ocaml source code here.
First, let's define a datatype for regular expressions.
type re = C of char | Nil | Seq of re * re | Bot | Alt of re * re | Star of re
So C
is the constructor for single-character strings,
Nil
and Seq
correspond to $\epsilon$ and
$r\cdot r'$, and Bot
and Alt
correspond to
$\bot$ and $r \vee r'$.
Next, we'll define the nullability function, which returns true
if the regular expression accepts the empty string, and false
if
it doesn't. It's the obvious recursive definition.
let rec null = function | C _ | Bot -> false | Nil | Star _ -> true | Alt(r1, r2) -> null r1 || null r2 | Seq(r1, r2) -> null r1 && null r2Now come some scaffolding code. The
R
module is the type
of finite sets of regular expressions, the M
module are
finite maps keyed by sets of regexps, and I
is the type
of finite sets of int
s. The rmap
function
maps a function over a set of regexps, building a new regexp. (I don't
know why this is not in the standard Set
signature.)
module R = Set.Make(struct type t = re let compare = compare end) let rmap f rs = R.fold (fun r -> R.add (f r)) rs R.empty module M = Map.Make(R) module I = Set.Make(struct type t = int let compare = compare end)The
aderiv
function implements the Antimirov derivative
function $\alpha_c$ described above. It's basically just a direct
transcription of the mathematical definition into Ocaml. The deriv
function applies the derivative to a whole set of regular expressions and
takes the union.
let rec aderiv c = function | C c' when c = c' -> R.singleton Nil | C _ | Nil | Bot -> R.empty | Alt(r, r') -> R.union (aderiv c r) (aderiv c r') | Seq(r1, r2) -> R.union (rmap (fun r1' -> Seq(r1', r2)) (aderiv c r1)) (if null r1 then aderiv c r2 else R.empty) | Star r -> rmap (fun r' -> Seq(r', Star r)) (aderiv c r) let deriv c rs = R.fold (fun r acc -> R.union (aderiv c r) acc) rs R.empty
Since the set of partial derivatives is finite, this means that the powerset of this set is also finite, and so by treating sets of partial derivatives as states, we can construct a deterministic finite-state automaton for matching regular expressions. Let's define an Ocaml type for DFAs:
type dfa = {size : int; fail : int; trans : (int * char * int) list; final : int list}
Here, size
is the number of states, and we'll use
integers in the range [0,size)
to label the states. We'll
use fail
to label the sink state for non-matching strings,
and take trans
to be the list of transitions. The final
field is a list of accepting states for the DFA.
Now, we'll need a little more scaffolding. The enum
function
is a "functional for-loop" looping from i
to max
.
We use this to write charfold
, which lets us fold over all of the
ASCII characters.
let rec enum f v i max = if i < max then enum f (f i v) (i+1) max else v let charfold f init = enum (fun i -> f (Char.chr i)) init 0 256
We use this to define the dfa
function, which constructs
a DFA from a regular expression:
let find rs (n,m) = try M.find rs m, (n,m) with _ -> n, (n+1, M.add rs n m) let dfa r = let rec loop s v t f rs = let (x, s) = find rs s in if I.mem x v then (s, v, t, f) else charfold (fun c (s, v, t, f) -> let rs' = deriv c rs in let (y, s) = find rs' s in loop s v ((x,c,y) :: t) f rs') (s, I.add x v, t, if R.exists null rs then x :: f else f) in let (s, v, t, f) = loop (0, M.empty) I.empty [] [] (R.singleton r) in let (fail, (n, m)) = find R.empty s in { size = n; fail = fail; trans = t; final = f }
The find
function takes a set of regular expression and returns
a numeric index for it. To do this, it uses a state (n, m)
, where
n
is a counter for a gensym, and m
is the map we use
to map sets of regular expressions to their indices.
The main work happens in the loop
function. The s
parameter is the state parameter for find
, and v
is
the visited set storing the set of states we have previously visited. The t
parameter is the list of transitions built to date, and f
are the
final states generated so far.
The rs
parameter is the current set of regular
expressions. We first look up its index x
, and if we have
visited it, we return. Otherwise, we add the current state to the
visited set (and to the final set if any of its elements are
nullable), and iterate over the ASCII characters. For each character
c
, we can take the derivative of rs
, and
find its index y
. Then, we can add the transition
(x, c, y)
to t
, and loop on the derivative.
Essentially, this does a depth-first search of the spanning tree.
We then kick things off with empty states and the singleton set of
the argument regexp r
, and build a DFA from the return
value. Note that the failure state is simply the index corresponding to the
empty set of partial derivatives. The initial state will always be 0, since
the first state we find
will be the singleton set r
.
Since we have labelled states by integers from 0 to size
,
we can easily build a table-based matcher from our dfa
type.
type table = { m : int array array; accept : bool array; error : int } let table d = { error = d.fail; accept = Array.init d.size (fun i -> List.mem i d.final); m = (let a = Array.init d.size (fun _ -> Array.make 256 0) in List.iter (fun (x, c, y) -> a.(x).(Char.code c) <- y) d.trans; a) }
Here, the table
type has a field m
for
the transition table, an array of booleans indicating whether the
state accepts or not, and the error state. We build the table in the
obvious way in the table
function, by building an array
of array of integers and initializing it using the list of transitions.
Then, we can give the matching function. The matches'
function takes a table t
, a string s
, an
index i
, and a current state x
. It will just
trundle along updating its state, as long as we have not yet reached
the end of the string (or hit the error state), and then it will
report whether it ends in an accepting state or not. The
re_match
function just calls matches'
at
index 0, in state 0.
let rec matches' t s i x = if i < String.length s && x != t.error then matches' t s (i+1) t.m.(x).(Char.code s.[i]) else t.accept.(x) let re_match t s = matches' t s 0 0
And that's it! Obviously more work is needed to make this a real regexp implementation, but the heart of the story is here. It's also possible to improve this implementation a bit further by adding a DFA minimization step, but that's a tale for another day. (DFA minimization is a very pretty little coinductive algorithm, and so makes a nice companion to the inductive story here.)