data CharClass = Pos String | Neg String deriving (Eq, Ord) elemCC :: Char -> CharClass -> Bool elemCC c (Pos cs) = c `elem` cs elemCC c (Neg cs) = c `notElem` cs
Regex Calculus
In 1964, Brzozowski published Derivatives of Regular Expressions, which describes how to:
-
Directly convert a regular expression to a deterministic finite automaton (DFA). No NFAs needed.
-
Support rich regular expressions. In particular, we have logical AND and NOT, e.g.
[a-z]*&!(()|do|for|if|while)
. -
Instantly obtain small and often minimal DFAs in typical applications.
This elegant and powerful method was almost forgotten. It was rescued from obscurity by Owens, Reppy, and Turon, in Regular-expression derivatives reexamined. May regex derivatives one day earn their rightful place in computer science.
Character Classes
In traditional definitions of regular expressions, a literal character is a regular expression constant that represents a set, or language, containing just that single character.
The alphabet is typically assumed to be small enough so that, for example, we can cheaply iterate over each symbol of the alphabet. This is unrealistic for Unicode, and arguably for ASCII too. Thus instead of literal characters, we’ll take our regular expression constants to be character classes or character sets.
To avoid distractions, we make do with a simple implementation of character
classes. We represents a character class either as Pos [Char]
where the list
holds the characters of the class, or as Neg [Char]
where the list holds the
characters outside the class. The function elemCC
tests if a given character
is in a given character class.
For example:
elemCC 'a' $ Pos "abc" elemCC 'z' $ Pos "abc" elemCC 'a' $ Neg "abc" elemCC 'z' $ Neg "abc"
We write [c0c1…]
for the set of characters c0, c1, …
, and
[^c0c1…]
for the complement of this set. We may omit the square
brackets for singleton sets, and we use hyphens to denote ranges of characters.
For example, x
means [x]
and [a-d5-9]
means [abcd56789]
.
The empty character class \([]\) rejects all strings; even the empty string.
nubSort [] = [] nubSort (x:xt) = nubSort (filter (< x) xt) ++ [x] ++ nubSort (filter (> x) xt) meta :: String meta = "\\|&!*.[]()" instance Show CharClass where show cc = case cc of Pos [] -> "[]" Neg [] -> "." Pos [c] -> showSingle c Neg [c] -> concat ["[^", showSingle c, "]"] Pos s -> concat ["[" , f $ nubSort s, "]"] Neg s -> concat ["[^", f $ nubSort s, "]"] where showSingle c | c `elem` meta = '\\':[c] | otherwise = [c] f "" = "" f [c] = [c] f (c:t) = rangeFinder c c t rangeFinder lo hi (h:t) | h == succ hi = rangeFinder lo h t rangeFinder lo hi t | lo == hi = lo:f t | succ lo == hi = lo:hi:f t | succ (succ lo) == hi = [lo..hi] ++ f t | otherwise = lo:'-':hi:f t
For example:
Pos "packmyboxwithfivedozenliquorjugs" Neg "helloworldmn"
A more sophisticated data structure might maintain sets of ranges.
Regexes, irregularly
Apart from constants, regexes are built from other regexes \(r\) and \(s\):
-
\(rs\): concatenation.
-
\(r\mid s\): logical or (alternation); the union of the two languages.
-
\(r\mbox{*}\): Kleene closure; zero or more strings of \(r\) concatenated together.
-
\(r\& s\): logical and; the intersection of the two languages.
-
\(!r\): logical not (complement); accepts a string if and only if \(r\) rejects it.
Let \(() = []\mbox{*}\), that is, the Kleene closure of the empty language. Observe \(()\) accepts only the empty string,
All the binary operators are associative, so we define a regex data type that represents chains of the same operator with list of the arguments. This simplifies our code later, when we sort arguments for commutative operators, and remove duplicates or identities.
data Re = OneOf CharClass | Kleene Re | ReCat [Re] | ReOr [Re] | ReAnd [Re] | ReNot Re deriving (Eq, Ord) instance Show Re where showsPrec p = \case OneOf s -> shows s Kleene (OneOf (Pos [])) -> ("()"++) Kleene r -> showsPrec 4 r . ("*"++) ReNot r -> showParen (p > 3) $ ('!':) . showsPrec 3 r ReCat rs -> showParen (p > 2) $ foldr (.) id $ showsPrec 2 <$> rs ReAnd rs -> showParen (p > 1) $ foldr (.) id $ intersperse ('&':) $ showsPrec 1 <$> rs ReOr rs -> showParen (p > 0) $ foldr (.) id $ intersperse ('|':) $ showsPrec 0 <$> rs
If my compiler supported it, the following would be patterns. But they’re still useful as definitions.
-- The regex `()`. The language containing only the empty string. eps :: Re eps = Kleene noGood -- The regex `[]`. The empty language. noGood :: Re noGood = OneOf $ Pos [] -- The regex `.*`. The language containing everything. allGood :: Re allGood = Kleene $ OneOf $ Neg []
Regex exercises
Let \(f\) be a regex.
1. Does \(f\) accept the empty string?
If so, we say \(f\) is nullable.
Character sets never accept the empty string. The Kleene closure always accepts the empty string. Concatenation does if and only if \(r\) and \(s\) do. The logical operations commute with nullability.
nullable :: Re -> Bool nullable = \case OneOf _ -> False Kleene _ -> True ReCat rs -> all nullable rs ReOr rs -> any nullable rs ReAnd rs -> all nullable rs ReNot r -> not $ nullable r
For example:
nullable eps nullable noGood nullable allGood
2. What regex do we get after feeding the character \(c\) to \(f\)? In other words, what regex accepts a string \(s\) precisely when \(f\) accepts \(c\) followed by \(s\)?
We call this regex the derivative of the regex \(f\)
with respect to the character \(c\), and write \(\partial_c f\).
For example, \(\partial_a\) ab*c|d*e*f|g*ah
= b*c|h
.
For a character set, the answer is \(()\) if \(c\) is a member. Otherwise
it’s \([]\). For example, \(\partial_a\) [abc]
= ()
and \(\partial_a\) [xyz]
= []
.
For the other cases, let us follow Lagrange and use a prime mark to denote a derivative (with respect to \(c\)). We find \(r'\) and \(s'\), that is, we recursively answer the question for \(r\) and \(s\). Then the logical operations commute with taking derivatives:
-
\((r\mid s)' = r' \mid s'\)
-
\((r\& s)' = r' \& s'\)
-
\((!r)' = !r'\)
For the Kleene closure, we find:
-
\((r\mbox{*})' = r'r\mbox{*}\)
The trickiest is concatenation: \(rs\). If \(r\) is nullable, that is, if \(r\) accepts the empty string, then the derivative is \(r’s\mid s'\), otherwise it is just \(r’s\).
deriveNaive :: Char -> Re -> Re deriveNaive c = go where go = \case OneOf cc | elemCC c cc -> eps | otherwise -> noGood Kleene r -> ReCat [go r, Kleene r] ReCat (r:rt) | nullable r -> ReOr [ReCat $ go r : rt, go $ ReCat rt] | otherwise -> ReCat $ go r : rt ReAnd rs -> ReAnd $ go <$> rs ReOr rs -> ReOr $ go <$> rs ReNot r -> ReNot $ go r
For example:
abcsxyz = ReOr [Kleene $ s "abc", ReCat [s "x", s "y", s "z"]] where s = OneOf . Pos abcsxyz deriveNaive 'a' abcsxyz deriveNaive 'd' abcsxyz deriveNaive 'x' abcsxyz deriveNaive 'y' $ deriveNaive 'x' abcsxyz
It works, but the derivatives grow quickly, and are full of dead code. We address this by folding constants as we derive:
mkCat:: [Re] -> Re mkCat xs | noGood `elem` zs = noGood | null zs = eps | [z] <- zs = z | otherwise = ReCat zs where zs = filter (/= eps) flat flat = concatMap deCat xs deCat (ReCat rs) = rs deCat r = [r] mkOr :: [Re] -> Re mkOr xs | allGood `elem` zs = allGood | null zs = noGood | [z] <- zs = z | otherwise = ReOr zs where zs = nubSort $ filter (/= noGood) flat flat = concatMap deOr xs deOr (ReOr rs) = rs deOr r = [r] mkAnd :: [Re] -> Re mkAnd xs | noGood `elem` zs = noGood | null zs = allGood | [z] <- zs = z | otherwise = ReAnd zs where zs = nubSort $ filter (/= allGood) flat flat = concatMap deAnd xs deAnd (ReAnd rs) = rs deAnd r = [r] mkKleene :: Re -> Re mkKleene (Kleene s) = mkKleene s mkKleene r = Kleene r mkNot :: Re -> Re mkNot (OneOf (Pos [])) = allGood mkNot (ReNot s) = s mkNot r = ReNot r derive :: Char -> Re -> Re derive c re = case re of OneOf cc | elemCC c cc -> eps | otherwise -> noGood Kleene r -> mkCat [derive c r, mkKleene r] ReCat (r : rt) | nullable r -> mkOr [mkCat $ derive c r : rt, derive c $ ReCat rt] | otherwise -> mkCat $ derive c r : rt ReAnd rs -> mkAnd $ derive c <$> rs ReOr rs -> mkOr $ derive c <$> rs ReNot r -> mkNot $ derive c r
Much better:
derive 'a' abcsxyz derive 'd' abcsxyz derive 'x' abcsxyz derive 'y' $ derive 'x' abcsxyz
Reading Regexes
We write parser combinators to parse regexes. We use the above constructors to get smaller regexes, and also so it’s more likely that regexes for the same language look the same.
We deviate from conventional syntax slightly. We add the metacharacters &
and
!
for logical and and logical not. We lack +
and ?
, but:
-
r+
is equivalent to!()&r*
-
r?
is equivalent to()|r
jsEval "curl_module('../compiler/Charser.ob')"
import Charser regex :: Charser Re regex = top <* eof where top = mkOr <$> ands `sepBy` char '|' ands = mkAnd <$> cats `sepBy` char '&' cats = mkCat <$> many nots nots = (char '!' >> mkNot <$> nots) <|> (atm >>= kle) atm = chCl <|> const (OneOf $ Neg []) <$> char '.' <|> char '(' *> top <* char ')' kle :: Re -> Charser Re kle r = char '*' *> kle (mkKleene r) <|> pure r chCl = fmap OneOf $ (Pos . (:[]) <$> single) <|> char '[' *> parity <* char ']' parity = (char '^' *> pure Neg <|> pure Pos) <*> (nubSort . concat <$> many rng) rng = alphaNumChar >>= \lo -> hiEnd lo <|> pure [lo] single = char '\\' *> oneOf meta <|> noneOf meta hiEnd lo = do hi <- char '-' *> alphaNumChar when (hi < lo) $ Charser $ const $ Left "invalid range end" pure [lo..hi]
For example:
parse regex "" "[he-ll-oworld]*&![]*"
Regexes restated
A regex operation is completely defined by its nullability and its derivative. The following function determines if a given regex accepts a given string:
accepts :: Re -> String -> Bool accepts re "" = nullable re accepts re (c:s) = accepts (derive c re) s
For example:
accepts abcsxyz <$> [ "cccbbacacbca" , "abcd" , "xyz" , "abcxyz" ]
Thus we can view matching a string against a regex as a running a deterministic automaton where each state is a regex, and each transition is differentiation with respect to a character, and nullable regexes are precisely the accepting states:
circ r [x, y] = "<circle r='" ++ show r ++ "' cx='" ++ shows x "' cy='" ++ shows y "' fill='none' stroke='black'></circle>" textAt [x, y] s = concat [ "<text x='" ++ show x ++ "' y='" ++ show y ++ "'" , " text-anchor='middle' alignment-baseline='central' style='font-size:8px;font-family:monospace'>" ++ s ++ "</text>" ] arrow :: [Double] -> [Double] -> String -> String arrow [x0, y0] [x1, y1] lbl = concat [ "<line x1='" ++ shows x0 "' y1='" ++ shows y0 "'" , "x2='" ++ shows x1 "' y2='" ++ shows y1 "'" , "fill='none' stroke='lightgrey' stroke-width='1' marker-end='url(#Triangle)'></line>" , textAt [(x0 + x1)/2, (y0 + y1)/2] lbl ] let Right re = parse regex "" "ab*|c*ad" in jsEval $ concat [ "oneEdgeExample.innerHTML = `" , [r|<svg xmlns="http://www.w3.org/2000/svg" width="100%" viewBox="-60 -60 420 120"><defs><marker id="Triangle" viewBox="0 0 10 10" refX="1" refY="5" orient="auto" markerWidth="6" markerHeight="6"><path d="M 0 0 L 10 5 L 0 10 z"></path></marker></defs>|] , circ 20 [50, 0] , textAt [50, 0] $ show re , circ 24 [250, 0] , circ 20 [250, 0] , textAt [250, 0] $ show $ derive 'a' re , arrow [70, 0] [221, 0] "a" , "</svg>`;" ]
But are there a finite number of states? What happens if we repeatedly differentiate with respect to all possible symbols?
It turns out that eventually we’ll see the same old set of regexes, provided we exploit knowing that logical disjunction is:
-
idempotent: \(r\mid r = r\)
-
commutative: \(r\mid s = s\mid r\)
-
associative: \((r\mid s)\mid t = r\mid (s\mid t)\)
This makes sense intuitively, because deriving usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.
Our smart constant-folding constructors incorporate the above rules; indeed they apply even more rules, so we are more likely to identify equivalent regexes and produce fewer states. In principle we could use them to convert a regex into a DFA by repeatedly differentiating with respect to each symbol of the alphabet.
However, this is impractical. We ought to take into account that many symbols lead to the same derivative. For a given regex, ideally we’d like to partition the alphabet into equivalence classes, where characters belong to the same class exactly when they lead to the same derivatives. Then we need only compute one derivative per class.
Classy Regexes
Finding equivalence classes is infeasible, so we get by with an imperfect algorithm.
If the regex is just a character class, then we can achieve perfection. Divide the alphabet into the "haves" and the "have-nots": those characters within the class, and those without.
Otherwise, we recursively construct our partition of the alphabet. For the Kleene star and logical not, we use the partition of the underlying regex. The other operations have two regex arguments, and we can stay out of trouble by taking all pairs of intersections of character classes of their partitions.
For concatenation, when the first regex is not nullable, we can ignore the second regex.
classy :: Re -> [CharClass] classy re = case re of OneOf (Pos cs) -> [Pos cs, Neg cs] OneOf (Neg cs) -> [Pos cs, Neg cs] Kleene r -> classy r ReNot r -> classy r ReOr rs -> foldr1 allPairs $ classy <$> rs ReAnd rs -> foldr1 allPairs $ classy <$> rs ReCat (r:rt) | nullable r -> classy r `allPairs` classy (ReCat rt) | otherwise -> classy r where allPairs r s = nub $ intersectCC <$> r <*> s intersectCC :: CharClass -> CharClass -> CharClass intersectCC (Pos xs) (Pos ys) = Pos $ intersect xs ys intersectCC (Pos xs) (Neg ys) = Pos $ xs \\ ys intersectCC (Neg xs) (Pos ys) = Pos $ ys \\ xs intersectCC (Neg xs) (Neg ys) = Neg $ union xs ys unionCC :: CharClass -> CharClass -> CharClass unionCC (Pos xs) (Pos ys) = Pos $ union xs ys unionCC (Pos xs) (Neg ys) = Neg $ ys \\ xs unionCC (Neg xs) (Pos ys) = Neg $ xs \\ ys unionCC (Neg xs) (Neg ys) = Neg $ intersect xs ys
For example:
classy abcsxyz
The function repCC
returns a member of a character class.
For Pos
, we pick the first character in the list.
This always succeeds because our code never calls repCC
with Pos []
.
For Neg
, we search for the smallest character not in the list.
We assume Neg
never applies to the whole alphabet.
repCC :: CharClass -> Char repCC (Pos (h:_)) = h repCC (Pos []) = error "BUG! Pos [] should be filtered out." repCC (Neg cs) | Just c <- find (`notElem` cs) [chr 0..] = c | otherwise = error "Neg with entire alphabet."
Deriving a DFA
Thanks to this partitioning algorithm, we can easily construct a DFA for a given regex \(f\) as follows.
We begin with the single start state \(f\). For each character class \(C\) in
classy f
, we pick a representative \(c\) and find \(\partial_c f\). Draw an
arrow labeled \(C\) from \(f\) to \(\partial_c f\), creating the latter state
if this is the first time we’ve seen it. Then repeat for every freshly created
state. Mark all nullable regexes as accepting states.
jsEval "curl_module('../compiler/Map.ob')"
import Map as M swap (a,b) = (b,a) mkDfa :: Re -> ([(Int, Re)], Int, [Int], [((Int, Int), CharClass)]) mkDfa r = (swap <$> M.assocs states, states!r, as, collated) where collated = M.assocs $ M.fromListWith unionCC edges (states, edges) = explore (singleton r 0, []) r as = snd <$> filter (nullable . fst) (M.assocs states) explore gr q = foldl (goto q) gr $ filter (/= Pos []) $ classy q goto q (qs, ds) cc | Just w <- mlookup qc qs = (qs, mkEdge w) | otherwise = explore (M.insert qc sz qs, mkEdge sz) qc where qc = derive (repCC cc) q sz = M.size qs mkEdge dst = ((qs!q, dst), cc):ds
We number the regexes to simplify our interface; users of our engine need only deal with integers. We retain a map of integers to regexes in case the caller seeks a deeper understanding of our DFA.
Rendering Regexes
We place DFA nodes with a force-directed layout algorithm, which in turn uses a PCG pseudo-random number generator.
data PCG = PCG Word64 Word64 pcg a b = PCG (a + b') b' where b' = 2*b + 1 next :: PCG -> (Word, PCG) next (PCG x inc) = (r, PCG x' inc) where x' = 6364136223846793005*x + inc r = (fromIntegral (x `xor` (x `shiftR` 18) `shiftR` 27) :: Word) `rotateR` (fromIntegral $ x `shiftR` 59) fromPCG p = map (fromIntegral . fst) $ tail $ iterate (next . snd) (undefined, p) split :: PCG -> (PCG, PCG) split p = (PCG (lohi a b) (lohi c d), PCG (lohi e f) (lohi g h)) where [a,b,c,d,e,f,g,h] = take 8 $ fromPCG p lohi = Word64 rng = pcg 42 54 sim isEdge vs = nudge <$> vs where nudge (v, p0) = (v, foldr ($) p0 $ [zipWith (+) $ delta (isEdge v w) $ zipWith (-) p0 p1 | (w, p1) <- vs, v /= w]) delta sprung dp = (f*) <$> dp where d2 = sum $ map (^2) dp d = sqrt d2 -- f = (0.1*) $ (900/(d*d) + bool 0 (10 - d) sprung)/d f = 100/(d2*d) + bool 0 (1/d - 0.1) sprung
We generate SVG to display a DFA:
normalize :: [(Int, [Double])] -> [(Int, [Double])] normalize [(v, _)] = [(v, [150, 150])] normalize vs = map (second $ zipWith rescale $ zip los his) vs where ps = snd <$> vs los = foldr1 (zipWith min) ps his = foldr1 (zipWith max) ps rescale (x0, x1) x = (x - x0) / (x1 - x0) * 300 render k re = concat -- Nodes. [ circ 8 =<< snd <$> ps -- Accepting states. , circ 11 =<< snd <$> filter ((`elem` acceptable) . fst) ps -- Start state. , unwords [ "<polyline points='" ++ show (startx - 32) ++ "," ++ show (starty - 10) , show (startx - 20) ++ "," ++ show (starty - 10) , show (startx - 25) ++ "," ++ show starty , show (startx - 15) ++ "," ++ show starty ++ "'" , "fill='none' stroke='black' stroke-width='1.5' marker-end='url(#Triangle)'></polyline>" ] -- Edges. , drawEdge ps =<< es ] where (vs0, start, acceptable, es0) = mkDfa re -- Suppress reject state. (bads, vs) = first (map fst) $ partition ((noGood ==) . snd) vs0 es = filter (\((_, w), _) -> w `notElem` bads) es0 scatter = fst $ foldr randpoint ([], rng) $ fst <$> vs randpoint v (ps, r) = ((v, coord <$> [x, y]):ps, r'') where (x, r') = next r (y, r'') = next r' coord w = fromIntegral (w `mod` 10000) / 100.0 ps = normalize $ iterate (sim isEdge) scatter !! k isEdge v w = maybe False (const True) $ lookup (v, w) es <|> lookup (w, v) es Just [startx, starty] = lookup start ps drawEdge ps ((v, w), cc) | v == w = concat -- Self-loop. [ "<path d='M ", show (x0 - 7.78), " ", show (y0 - 7.78) , " C ", show (x0 - 30), " ", show (y0 - 30) , " ", show (x0 + 30), " ", show (y0 - 30) , " ", show (x0 + 7.78), " ", show (y0 - 7.78) , "' fill='none' stroke='grey' marker-end='url(#Triangle)'></path>" , textAt [x0, y0 - 30] $ show cc ] | otherwise = arrow [x0 + 11*nx, y0 + 11*ny] [x1 - 15*nx, y1 - 15*ny] $ show cc where Just a@[x0, y0] = lookup v ps Just b@[x1, y1] = lookup w ps [dx, dy] = zipWith (-) b a [nx, ny] = (/ sqrt (dx*dx + dy*dy)) <$> [dx, dy] svgStart = [r| <svg xmlns="http://www.w3.org/2000/svg" id="svg" width="100%" height="30em" viewBox="-60 -60 360 400"><defs> <marker id="Triangle" viewBox="0 0 10 10" refX="1" refY="5" orient="auto" markerWidth="6" markerHeight="6"><path d="M 0 0 L 10 5 L 0 10 z"></path></marker> </defs> |] draw s = jsEval $ "runme_out.insertAdjacentHTML('beforeend',`" ++ svgStart ++ s ++ "</svg>`);" void = (*> pure ()) demo s = case parse regex "" s of Left _ -> putStrLn "parse error" Right re -> void $ draw $ render 16 re demo "[a-e]([b-d]|[c-f]*)[0-3]"
Lastly, we hook up our code to the UI elements at the top of this page.
dfaClick = do s <- jsEval "re.value;" case parse regex "" s of Left _ -> jsEval "dfaDiv.innerHTML = `parse error`;" Right re -> jsEval $ "dfaDiv.innerHTML = `" ++ svgStart ++ render 16 re ++ "</svg>" ++ "`;" dfaClick jsEval [r|dfaButton.addEventListener("click", (ev) => { repl.run("chat", ["Main"], "dfaClick"); });|]
Exponential worst-case
Converting
regexes to DFAs can involve an exponential blow-up. For example,
every (a|b)
on the end of the following regex approximately doubles the
number of states:
(a|b)*a(a|b)(a|b)(a|b)(a|b)
One might try forgoing the DFA, and only computing derivatives for the strings that we wish to match, but regex derivatives also blow up badly. This is a pity, as it suggests in general we may need NFAs.