Regex Calculus

  • Directly convert a regular expression to a deterministic finite automaton (DFA). No NFAs needed.

  • Support rich regular expressions. In particular, we have logical AND and NOT, e.g. [a-z]*&!(()|do|for|if|while).

  • Instantly obtain small and often minimal DFAs in typical applications.

This elegant and powerful method was almost forgotten. It was rescued from obscurity by Owens, Reppy, and Turon, in Regular-expression derivatives reexamined. May regex derivatives one day earn their rightful place in computer science.

Character Classes

In traditional definitions of regular expressions, a literal character is a regular expression constant that represents a set, or language, containing just that single character.

The alphabet is typically assumed to be small enough so that, for example, we can cheaply iterate over each symbol of the alphabet. This is unrealistic for Unicode, and arguably for ASCII too. Thus instead of literal characters, we’ll take our regular expression constants to be character classes or character sets.

To avoid distractions, we make do with a simple implementation of character classes. We represents a character class either as Pos [Char] where the list holds the characters of the class, or as Neg [Char] where the list holds the characters outside the class. The function elemCC tests if a given character is in a given character class.

data CharClass = Pos String | Neg String deriving (Eq, Ord)

elemCC :: Char -> CharClass -> Bool
elemCC c (Pos cs) = c `elem` cs
elemCC c (Neg cs) = c `notElem` cs

For example:

elemCC 'a' $ Pos "abc"
elemCC 'z' $ Pos "abc"

elemCC 'a' $ Neg "abc"
elemCC 'z' $ Neg "abc"

We write [c0c1…​] for the set of characters c0, c1, …​, and [^c0c1…​] for the complement of this set. We may omit the square brackets for singleton sets, and we use hyphens to denote ranges of characters. For example, x means [x] and [a-d5-9] means [abcd56789].

The empty character class \([]\) rejects all strings; even the empty string.

nubSort [] = []
nubSort (x:xt) = nubSort (filter (< x) xt) ++ [x] ++ nubSort (filter (> x) xt)

meta :: String
meta = "\\|&!*.[]()"

instance Show CharClass where
  show cc = case cc of
    Pos []  -> "[]"
    Neg []  -> "."
    Pos [c] -> showSingle c
    Neg [c] -> concat ["[^", showSingle c, "]"]
    Pos s   -> concat ["[" , f $ nubSort s, "]"]
    Neg s   -> concat ["[^", f $ nubSort s, "]"]
    where
    showSingle c | c `elem` meta = '\\':[c]
                 | otherwise     = [c]
    f ""    = ""
    f [c]   = [c]
    f (c:t) = rangeFinder c c t
    rangeFinder lo hi (h:t) | h == succ hi = rangeFinder lo h t
    rangeFinder lo hi t
      |            lo  == hi = lo:f t
      |       succ lo  == hi = lo:hi:f t
      | succ (succ lo) == hi = [lo..hi] ++ f t
      | otherwise            = lo:'-':hi:f t

For example:

Pos "packmyboxwithfivedozenliquorjugs"
Neg "helloworldmn"

A more sophisticated data structure might maintain sets of ranges.

Regexes, irregularly

Apart from constants, regexes are built from other regexes \(r\) and \(s\):

  • \(rs\): concatenation.

  • \(r\mid s\): logical or (alternation); the union of the two languages.

  • \(r\mbox{*}\): Kleene closure; zero or more strings of \(r\) concatenated together.

  • \(r\& s\): logical and; the intersection of the two languages.

  • \(!r\): logical not (complement); accepts a string if and only if \(r\) rejects it.

Let \(() = []\mbox{*}\), that is, the Kleene closure of the empty language. Observe \(()\) accepts only the empty string,

All the binary operators are associative, so we define a regex data type that represents chains of the same operator with list of the arguments. This simplifies our code later, when we sort arguments for commutative operators, and remove duplicates or identities.

data Re = OneOf CharClass | Kleene Re | ReCat [Re]
  | ReOr [Re] | ReAnd [Re] | ReNot Re
  deriving (Eq, Ord)

instance Show Re where
  showsPrec p = \case
    OneOf s -> shows s
    Kleene (OneOf (Pos [])) -> ("()"++)
    Kleene r -> showsPrec 4 r . ("*"++)
    ReNot r  -> showParen (p > 3) $ ('!':) . showsPrec 3 r
    ReCat rs -> showParen (p > 2) $ foldr (.) id $ showsPrec 2 <$> rs
    ReAnd rs -> showParen (p > 1) $ foldr (.) id $
      intersperse ('&':) $ showsPrec 1 <$> rs
    ReOr rs  -> showParen (p > 0) $ foldr (.) id $
      intersperse ('|':) $ showsPrec 0 <$> rs

If my compiler supported it, the following would be patterns. But they’re still useful as definitions.

-- The regex `()`. The language containing only the empty string.
eps :: Re
eps = Kleene noGood

-- The regex `[]`. The empty language.
noGood :: Re
noGood = OneOf $ Pos []

-- The regex `.*`. The language containing everything.
allGood :: Re
allGood = Kleene $ OneOf $ Neg []

Regex exercises

Let \(f\) be a regex.

1. Does \(f\) accept the empty string?

If so, we say \(f\) is nullable.

Character sets never accept the empty string. The Kleene closure always accepts the empty string. Concatenation does if and only if \(r\) and \(s\) do. The logical operations commute with nullability.

nullable :: Re -> Bool
nullable = \case
  OneOf _  -> False
  Kleene _ -> True
  ReCat rs -> all nullable rs
  ReOr  rs -> any nullable rs
  ReAnd rs -> all nullable rs
  ReNot r  -> not $ nullable r

For example:

nullable eps
nullable noGood
nullable allGood

2. What regex do we get after feeding the character \(c\) to \(f\)? In other words, what regex accepts a string \(s\) precisely when \(f\) accepts \(c\) followed by \(s\)?

We call this regex the derivative of the regex \(f\) with respect to the character \(c\), and write \(\partial_c f\). For example, \(\partial_a\) ab*c|d*e*f|g*ah = b*c|h.

For a character set, the answer is \(()\) if \(c\) is a member. Otherwise it’s \([]\). For example, \(\partial_a\) [abc] = () and \(\partial_a\) [xyz] = [].

For the other cases, let us follow Lagrange and use a prime mark to denote a derivative (with respect to \(c\)). We find \(r'\) and \(s'\), that is, we recursively answer the question for \(r\) and \(s\). Then the logical operations commute with taking derivatives:

  • \((r\mid s)' = r' \mid s'\)

  • \((r\& s)' = r' \& s'\)

  • \((!r)' = !r'\)

For the Kleene closure, we find:

  • \((r\mbox{*})' = r'r\mbox{*}\)

The trickiest is concatenation: \(rs\). If \(r\) is nullable, that is, if \(r\) accepts the empty string, then the derivative is \(r’s\mid s'\), otherwise it is just \(r’s\).

deriveNaive :: Char -> Re -> Re
deriveNaive c = go where
  go = \case
    OneOf cc | elemCC c cc -> eps
             | otherwise   -> noGood
    Kleene r               -> ReCat [go r, Kleene r]
    ReCat (r:rt)
      | nullable r         -> ReOr [ReCat $ go r : rt, go $ ReCat rt]
      | otherwise          -> ReCat $ go r : rt
    ReAnd rs               -> ReAnd $ go <$> rs
    ReOr  rs               -> ReOr  $ go <$> rs
    ReNot r                -> ReNot $ go r

For example:

abcsxyz = ReOr [Kleene $ s "abc", ReCat [s "x", s "y", s "z"]]
  where s = OneOf . Pos

abcsxyz
deriveNaive 'a' abcsxyz
deriveNaive 'd' abcsxyz
deriveNaive 'x' abcsxyz
deriveNaive 'y' $ deriveNaive 'x' abcsxyz

It works, but the derivatives grow quickly, and are full of dead code. We address this by folding constants as we derive:

mkCat:: [Re] -> Re
mkCat xs
  | noGood `elem` zs = noGood
  | null zs          = eps
  | [z] <- zs        = z
  | otherwise        = ReCat zs
  where
  zs = filter (/= eps) flat
  flat             = concatMap deCat xs
  deCat (ReCat rs) = rs
  deCat r          = [r]

mkOr :: [Re] -> Re
mkOr xs
  | allGood `elem` zs = allGood
  | null zs           = noGood
  | [z] <- zs         = z
  | otherwise         = ReOr zs
  where
  zs = nubSort $ filter (/= noGood) flat
  flat           = concatMap deOr xs
  deOr (ReOr rs) = rs
  deOr r         = [r]

mkAnd :: [Re] -> Re
mkAnd xs
  | noGood `elem` zs = noGood
  | null zs          = allGood
  | [z] <- zs        = z
  | otherwise        = ReAnd zs
  where
  zs = nubSort $ filter (/= allGood) flat
  flat             = concatMap deAnd xs
  deAnd (ReAnd rs) = rs
  deAnd r          = [r]

mkKleene :: Re -> Re
mkKleene (Kleene s) = mkKleene s
mkKleene r          = Kleene r

mkNot :: Re -> Re
mkNot (OneOf (Pos [])) = allGood
mkNot (ReNot s)        = s
mkNot r                = ReNot r

derive :: Char -> Re -> Re
derive c re = case re of
  OneOf cc | elemCC c cc -> eps
           | otherwise   -> noGood
  Kleene r               -> mkCat [derive c r, mkKleene r]
  ReCat (r : rt)
    | nullable r         -> mkOr [mkCat $ derive c r : rt, derive c $ ReCat rt]
    | otherwise          -> mkCat $ derive c r : rt
  ReAnd rs               -> mkAnd $ derive c <$> rs
  ReOr  rs               -> mkOr  $ derive c <$> rs
  ReNot r                -> mkNot $ derive c r

Much better:

derive 'a' abcsxyz
derive 'd' abcsxyz
derive 'x' abcsxyz
derive 'y' $ derive 'x' abcsxyz

Reading Regexes

We write parser combinators to parse regexes. We use the above constructors to get smaller regexes, and also so it’s more likely that regexes for the same language look the same.

We deviate from conventional syntax slightly. We add the metacharacters & and ! for logical and and logical not. We lack + and ?, but:

  • r+ is equivalent to !()&r*

  • r? is equivalent to ()|r

jsEval "curl_module('../compiler/Charser.ob')"
import Charser
regex :: Charser Re
regex = top <* eof where
  top = mkOr <$> ands `sepBy` char '|'
  ands = mkAnd <$> cats `sepBy` char '&'
  cats = mkCat <$> many nots
  nots = (char '!' >> mkNot <$> nots) <|> (atm >>= kle)
  atm = chCl
    <|> const (OneOf $ Neg []) <$> char '.'
    <|> char '(' *> top <* char ')'
  kle :: Re -> Charser Re
  kle r = char '*' *> kle (mkKleene r) <|> pure r
  chCl = fmap OneOf $ (Pos . (:[]) <$> single)
     <|> char '[' *> parity <* char ']'
  parity = (char '^' *> pure Neg <|> pure Pos) <*>
    (nubSort . concat <$> many rng)
  rng = alphaNumChar >>= \lo -> hiEnd lo <|> pure [lo]
  single = char '\\' *> oneOf meta <|> noneOf meta
  hiEnd lo = do
    hi <- char '-' *> alphaNumChar
    when (hi < lo) $ Charser $ const $ Left "invalid range end"
    pure [lo..hi]

For example:

parse regex "" "[he-ll-oworld]*&![]*"

Regexes restated

A regex operation is completely defined by its nullability and its derivative. The following function determines if a given regex accepts a given string:

accepts :: Re -> String -> Bool
accepts re ""    = nullable re
accepts re (c:s) = accepts (derive c re) s

For example:

accepts abcsxyz <$>
  [ "cccbbacacbca"
  , "abcd"
  , "xyz"
  , "abcxyz"
  ]

Thus we can view matching a string against a regex as a running a deterministic automaton where each state is a regex, and each transition is differentiation with respect to a character, and nullable regexes are precisely the accepting states:

circ r [x, y] = "<circle r='" ++ show r ++ "' cx='" ++ shows x "' cy='" ++ shows y "' fill='none' stroke='black'></circle>"

textAt [x, y] s = concat
    [ "<text x='" ++ show x ++ "' y='" ++ show y ++ "'"
    , " text-anchor='middle' alignment-baseline='central' style='font-size:8px;font-family:monospace'>" ++ s ++ "</text>"
    ]

arrow :: [Double] -> [Double] -> String -> String
arrow [x0, y0] [x1, y1] lbl = concat
  [ "<line x1='" ++ shows x0 "' y1='" ++ shows y0 "'"
  , "x2='" ++ shows x1 "' y2='" ++ shows y1 "'"
  , "fill='none' stroke='lightgrey' stroke-width='1' marker-end='url(#Triangle)'></line>"
  , textAt [(x0 + x1)/2, (y0 + y1)/2] lbl
  ]

let
  Right re = parse regex "" "ab*|c*ad"
  in jsEval $ concat
  [ "oneEdgeExample.innerHTML = `"
  , [r|<svg xmlns="http://www.w3.org/2000/svg" width="100%" viewBox="-60 -60 420 120"><defs><marker id="Triangle" viewBox="0 0 10 10" refX="1" refY="5" orient="auto" markerWidth="6" markerHeight="6"><path d="M 0 0 L 10 5 L 0 10 z"></path></marker></defs>|]
  , circ 20 [50, 0]
  , textAt [50, 0] $ show re
  , circ 24 [250, 0]
  , circ 20 [250, 0]
  , textAt [250, 0] $ show $ derive 'a' re
  , arrow [70, 0] [221, 0] "a"
  , "</svg>`;"
  ]

But are there a finite number of states? What happens if we repeatedly differentiate with respect to all possible symbols?

It turns out that eventually we’ll see the same old set of regexes, provided we exploit knowing that logical disjunction is:

  • idempotent: \(r\mid r = r\)

  • commutative: \(r\mid s = s\mid r\)

  • associative: \((r\mid s)\mid t = r\mid (s\mid t)\)

This makes sense intuitively, because deriving usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.

Our smart constant-folding constructors incorporate the above rules; indeed they apply even more rules, so we are more likely to identify equivalent regexes and produce fewer states. In principle we could use them to convert a regex into a DFA by repeatedly differentiating with respect to each symbol of the alphabet.

However, this is impractical. We ought to take into account that many symbols lead to the same derivative. For a given regex, ideally we’d like to partition the alphabet into equivalence classes, where characters belong to the same class exactly when they lead to the same derivatives. Then we need only compute one derivative per class.

Classy Regexes

Finding equivalence classes is infeasible, so we get by with an imperfect algorithm.

If the regex is just a character class, then we can achieve perfection. Divide the alphabet into the "haves" and the "have-nots": those characters within the class, and those without.

Otherwise, we recursively construct our partition of the alphabet. For the Kleene star and logical not, we use the partition of the underlying regex. The other operations have two regex arguments, and we can stay out of trouble by taking all pairs of intersections of character classes of their partitions.

For concatenation, when the first regex is not nullable, we can ignore the second regex.

classy :: Re -> [CharClass]
classy re = case re of
  OneOf (Pos cs) -> [Pos cs, Neg cs]
  OneOf (Neg cs) -> [Pos cs, Neg cs]
  Kleene r       -> classy r
  ReNot  r       -> classy r
  ReOr   rs      -> foldr1 allPairs $ classy <$> rs
  ReAnd  rs      -> foldr1 allPairs $ classy <$> rs
  ReCat (r:rt)
    | nullable r -> classy r `allPairs` classy (ReCat rt)
    | otherwise  -> classy r
  where allPairs r s = nub $ intersectCC <$> r <*> s

intersectCC :: CharClass -> CharClass -> CharClass
intersectCC (Pos xs) (Pos ys) = Pos $ intersect xs ys
intersectCC (Pos xs) (Neg ys) = Pos $ xs \\ ys
intersectCC (Neg xs) (Pos ys) = Pos $ ys \\ xs
intersectCC (Neg xs) (Neg ys) = Neg $ union xs ys

unionCC :: CharClass -> CharClass -> CharClass
unionCC (Pos xs) (Pos ys) = Pos $ union xs ys
unionCC (Pos xs) (Neg ys) = Neg $ ys \\ xs
unionCC (Neg xs) (Pos ys) = Neg $ xs \\ ys
unionCC (Neg xs) (Neg ys) = Neg $ intersect xs ys

For example:

classy abcsxyz

The function repCC returns a member of a character class. For Pos, we pick the first character in the list. This always succeeds because our code never calls repCC with Pos []. For Neg, we search for the smallest character not in the list. We assume Neg never applies to the whole alphabet.

repCC :: CharClass -> Char
repCC (Pos (h:_)) = h
repCC (Pos [])    = error "BUG! Pos [] should be filtered out."
repCC (Neg cs) | Just c <- find (`notElem` cs) [chr 0..] = c
               | otherwise = error "Neg with entire alphabet."

Deriving a DFA

Thanks to this partitioning algorithm, we can easily construct a DFA for a given regex \(f\) as follows.

We begin with the single start state \(f\). For each character class \(C\) in classy f, we pick a representative \(c\) and find \(\partial_c f\). Draw an arrow labeled \(C\) from \(f\) to \(\partial_c f\), creating the latter state if this is the first time we’ve seen it. Then repeat for every freshly created state. Mark all nullable regexes as accepting states.

jsEval "curl_module('../compiler/Map.ob')"
import Map as M

swap (a,b) = (b,a)

mkDfa :: Re -> ([(Int, Re)], Int, [Int], [((Int, Int), CharClass)])
mkDfa r = (swap <$> M.assocs states, states!r, as, collated) where
  collated = M.assocs $ M.fromListWith unionCC edges
  (states, edges) = explore (singleton r 0, []) r
  as = snd <$> filter (nullable . fst) (M.assocs states)
  explore gr q = foldl (goto q) gr $ filter (/= Pos []) $ classy q
  goto q (qs, ds) cc
    | Just w <- mlookup qc qs = (qs, mkEdge w)
    | otherwise = explore (M.insert qc sz qs, mkEdge sz) qc
    where qc = derive (repCC cc) q
          sz = M.size qs
          mkEdge dst = ((qs!q, dst), cc):ds

We number the regexes to simplify our interface; users of our engine need only deal with integers. We retain a map of integers to regexes in case the caller seeks a deeper understanding of our DFA.

Rendering Regexes

We place DFA nodes with a force-directed layout algorithm, which in turn uses a PCG pseudo-random number generator.

data PCG = PCG Word64 Word64
pcg a b = PCG (a + b') b' where b' = 2*b + 1

next :: PCG -> (Word, PCG)
next (PCG x inc) = (r, PCG x' inc) where
  x' = 6364136223846793005*x + inc
  r = (fromIntegral (x `xor` (x `shiftR` 18) `shiftR` 27) :: Word) `rotateR` (fromIntegral $ x `shiftR` 59)

fromPCG p = map (fromIntegral . fst) $ tail $ iterate (next . snd) (undefined, p)

split :: PCG -> (PCG, PCG)
split p = (PCG (lohi a b) (lohi c d), PCG (lohi e f) (lohi g h)) where
  [a,b,c,d,e,f,g,h] = take 8 $ fromPCG p
  lohi = Word64

rng = pcg 42 54

sim isEdge vs = nudge <$> vs where
  nudge (v, p0) = (v, foldr ($) p0 $ [zipWith (+) $ delta (isEdge v w) $ zipWith (-) p0 p1 | (w, p1) <- vs, v /= w])

delta sprung dp = (f*) <$> dp where
  d2 = sum $ map (^2) dp
  d = sqrt d2
  -- f = (0.1*) $ (900/(d*d) + bool 0 (10 - d) sprung)/d
  f = 100/(d2*d) + bool 0 (1/d - 0.1) sprung

We generate SVG to display a DFA:

normalize :: [(Int, [Double])] -> [(Int, [Double])]
normalize [(v, _)] = [(v, [150, 150])]
normalize vs = map (second $ zipWith rescale $ zip los his) vs where
  ps  = snd <$> vs
  los = foldr1 (zipWith min) ps
  his = foldr1 (zipWith max) ps
  rescale (x0, x1) x = (x - x0) / (x1 - x0) * 300

render k re = concat
  -- Nodes.
  [ circ 8 =<< snd <$> ps
  -- Accepting states.
  , circ 11 =<< snd <$> filter ((`elem` acceptable) . fst) ps
  -- Start state.
  , unwords
    [ "<polyline points='" ++ show (startx - 32) ++ "," ++ show (starty - 10)
    , show (startx - 20) ++ "," ++ show (starty - 10)
    , show (startx - 25) ++ "," ++ show starty
    , show (startx - 15) ++ "," ++ show starty ++ "'"
    , "fill='none' stroke='black' stroke-width='1.5' marker-end='url(#Triangle)'></polyline>"
    ]
  -- Edges.
  , drawEdge ps =<< es
  ] where
  (vs0, start, acceptable, es0) = mkDfa re
  -- Suppress reject state.
  (bads, vs) = first (map fst) $ partition ((noGood ==) . snd) vs0
  es = filter (\((_, w), _) -> w `notElem` bads) es0
  scatter = fst $ foldr randpoint ([], rng) $ fst <$> vs
  randpoint v (ps, r) = ((v, coord <$> [x, y]):ps, r'') where
    (x, r') = next r
    (y, r'') = next r'
    coord w = fromIntegral (w `mod` 10000) / 100.0
  ps = normalize $ iterate (sim isEdge) scatter !! k
  isEdge v w = maybe False (const True) $ lookup (v, w) es <|> lookup (w, v) es
  Just [startx, starty] = lookup start ps

drawEdge ps ((v, w), cc)
  | v == w = concat  -- Self-loop.
    [ "<path d='M ", show (x0 - 7.78), " ", show (y0 - 7.78)
    , " C ", show (x0 - 30), " ", show (y0 - 30)
    , " ", show (x0 + 30), " ", show (y0 - 30)
    , " ", show (x0 + 7.78), " ", show (y0 - 7.78)
    , "' fill='none' stroke='grey' marker-end='url(#Triangle)'></path>"
    , textAt [x0, y0 - 30] $ show cc
    ]
  | otherwise =
    arrow [x0 + 11*nx, y0 + 11*ny] [x1 - 15*nx, y1 - 15*ny] $ show cc
  where
  Just a@[x0, y0] = lookup v ps
  Just b@[x1, y1] = lookup w ps
  [dx, dy] = zipWith (-) b a
  [nx, ny] = (/ sqrt (dx*dx + dy*dy)) <$> [dx, dy]

svgStart = [r|
<svg xmlns="http://www.w3.org/2000/svg" id="svg" width="100%" height="30em"
viewBox="-60 -60 360 400"><defs>
<marker id="Triangle" viewBox="0 0 10 10" refX="1" refY="5" orient="auto" markerWidth="6" markerHeight="6"><path d="M 0 0 L 10 5 L 0 10 z"></path></marker>
</defs>
|]

draw s = jsEval $ "runme_out.insertAdjacentHTML('beforeend',`" ++ svgStart ++ s ++ "</svg>`);"

void = (*> pure ())

demo s = case parse regex "" s of
  Left _ -> putStrLn "parse error"
  Right re -> void $ draw $ render 16 re

demo "[a-e]([b-d]|[c-f]*)[0-3]"

Lastly, we hook up our code to the UI elements at the top of this page.

dfaClick = do
  s <- jsEval "re.value;"
  case parse regex "" s of
    Left _ -> jsEval "dfaDiv.innerHTML = `parse error`;"
    Right re -> jsEval $ "dfaDiv.innerHTML = `" ++
      svgStart ++ render 16 re ++ "</svg>" ++ "`;"

dfaClick

jsEval [r|dfaButton.addEventListener("click", (ev) => { repl.run("chat", ["Main"], "dfaClick"); });|]

Exponential worst-case

Converting regexes to DFAs can involve an exponential blow-up. For example, every (a|b) on the end of the following regex approximately doubles the number of states:

(a|b)*a(a|b)(a|b)(a|b)(a|b)

One might try forgoing the DFA, and only computing derivatives for the strings that we wish to match, but regex derivatives also blow up badly. This is a pity, as it suggests in general we may need NFAs.


Ben Lynn blynn@cs.stanford.edu 💡