parseInteger :: String -> Integer parseInteger = foldl (\n d -> 10*n + fromIntegral (fromDigit d)) 0 fromDigit c = ord c - ord '0' parseInteger "123" + parseInteger "456"
Parser Combinators
Suppose we write a function to parse an Integer
. What type should it have?
We might try String -> Integer
:
The read
function generalizes this, and it suffices for some
applications. Nevertheless, often we’d prefer a parser that also:
-
handles invalid input
-
allows trailing characters
Our design should take into account that we may want to parse any trailing characters with another function.
The hard way
One solution is a function that returns all valid ways to parse an Integer
at
the start of a string along with the remainder of the string. In particular, it
returns the empty list on failure.
String -> [(Integer, String)]
The reads
function generalizes this, and
Doaitse
Swierstra, Combinator Parsing: A Short Tutorial builds upon this approach.
However, we’ll insist on at most one correct way to parse a string. We also provide a mechanism for reporting an error message:
String -> Either String (Integer, String)
This type accommodates parser functions that take a string and return either an error message or the successfully parsed integer at the start of the string along with the remainder of the input.
With this design, we might write a function to parse a single digit as follows:
parseDigit = \case c:s' | '0' <= c, c <= '9' -> Right (fromDigit c, s') _ -> Left "want digit" parseDigit "2b" parseDigit "b2"
From this we can build a parser for a two-digit number:
parseTwoDigits s = do (a, s') <- parseDigit s (b, s'') <- parseDigit s' pure (10*a + b, s'') parseTwoDigits "221b" parseTwoDigits "2b21"
It works, but our design is infested with boilerplate. We must carefully manage the input string, ensuring that the unparsed remainder of one step is fed to the next. The pain worsens as we write more complex parsers.
We would prefer to write plumbing code like this once, and reuse it many times. This is easy in Haskell.
The easy way
We start by defining a data type to hold a parsing function, which we call a
Charser
, a sort of portmanteau meant to evoke a parser that handles one Char
at a time. We also generalize so that our parsers can return any desired type.
data Charser a = Charser { unCharser :: String -> Either String (a, String) }
Next, we define some fundamental parsers. The sat
parser returns the first
character if it satisfies a given predicate, and fails otherwise. THe eof
parser returns ()
if the given string is empty, and fails otherwise.
sat f = Charser \case h:t | f h -> Right (h, t) _ -> Left "unsat" eof = Charser \case [] -> Right ((), "") _ -> Left "want EOF"
We can easily get at the function via unCharser
:
unCharser (sat (== 'x')) "xyz" unCharser (sat (== 'x')) "abc" unCharser eof "" unCharser eof "abc"
Lastly, we define Functor
, Applicative
, Monad
, and Alternative
typeclass instances for the Charser
type. The Functor
instance is standard
stuff (that GHC can derive automatically) that lets us easily operate on a
parsed value, while the others describe how to compose two parsers. The Monad
and Applicative
instances meticulously thread the input string from the first
parser to the second parser, while the Alternative
instance builds a parser
that tries the first parser, and on failure tries the second parser.
These correpond to sequencing and the choice operator denoted by a vertical bar "|" in a BNF grammar, though in our framework we iterate through the choices in the order of appearance and take the first match.
instance Functor Charser where fmap f (Charser x) = Charser $ fmap (first f) . x instance Applicative Charser where pure a = Charser $ Right . (a,) f <*> x = Charser \s -> do (fun, t) <- unCharser f s (arg, u) <- unCharser x t pure (fun arg, u) instance Monad Charser where Charser f >>= g = Charser $ (uncurry (unCharser . g) =<<) . f return = pure instance Alternative Charser where empty = Charser $ const $ Left "" x <|> y = Charser \s -> either (const $ unCharser y s) Right $ unCharser x s
And we’re done! We have just defined parser combinators, which we compose to build complex parsers without needing to juggle bits and pieces of input strings. Our two-digit parser example becomes:
digitChar = sat \c -> '0' <= c && c <= '9' twoDigits = do a <- digitChar b <- digitChar pure $ 10*fromDigit a + fromDigit b unCharser twoDigits "221b" unCharser twoDigits "2b21"
Arbitrary-precision calculator
We demonstrate a more sophisticated parser that parses an integer arithmetic expression. Rather than store an expression in a tree data structure, we evaluate as we parse.
char :: Char -> Charser Char char = sat . (==) oneOf :: String -> Charser Char oneOf s = sat (`elem` s) chainl1 :: Charser a -> Charser (a -> a -> a) -> Charser a chainl1 p infixOp = go =<< p where go x = (go =<< ($x) <$> infixOp <*> p) <|> pure x chainr1 :: Charser a -> Charser (a -> a -> a) -> Charser a chainr1 p infixOp = (&) <$> p <*> (flip <$> infixOp <*> chainr1 p infixOp <|> pure id) sp :: Charser a -> Charser a sp p = p <* many (oneOf " \n") -- Eat trailing spaces. spch = sp . char op c f = spch c *> pure f line = sp (pure ()) *> expr <* eof where num = parseInteger <$> sp (some digitChar) atm = num <|> spch '(' *> expr <* spch ')' pow = chainr1 atm $ op '^' (^) una = foldr (.) id <$> many (op '+' id <|> op '-' negate) fac = una <*> pow term = chainl1 fac $ op '*' (*) <|> op '/' div <|> op '%' mod expr = chainl1 term $ op '+' (+) <|> op '-' (-) unCharser line "3 + 4 * 5"
Our line
parser closely resembles the grammar it parses:
num ::= ('0'|..|'9'|'.')+ atm ::= num | '(' expr ')' pow ::= atm ('^' atm )* una ::= ('+'|'-')* fac ::= una pow term ::= fac ( ('*'|'/'|'%') fac )* expr ::= term ( ('+'|'-') term )*
In fact, one might say the definition of line
is an
attribute grammar.
We omit the rules involving spaces, which are traditionally considered part of a separate phase called lexical analysis. With parser combinators, such divisions are blurrier, and it can be practical to combine the two phases into a single catch-all parsing phase.
We embed the parser in a calculator web widget. Type an expression into the textbox and press enter.
calc = do s <- jsEval "calcIn.value;" putStr $ case unCharser line s of Left err -> err Right (v, _) -> s ++ " = " ++ show v jsEval [r|runme_out.insertAdjacentHTML('beforeend', ` <textarea style="border: solid 4px #999999" id="calcIn" rows="1" cols="40">8888+88 / 8+8 / 8+8 * 8</textarea> <div id="calcOut"></div> `); calcIn.addEventListener("keydown", ev => { if (ev.keyCode == 13) { const r = repl.run("chat", ["Main"], "calc"); calcOut.innerText = r.out; calcIn.value = ""; ev.preventDefault(); } }); |]
The Fast Way
Suppose we have a complex grammar and a long input that fails to parse near the
end. Due to our implementation of (<|>)
, after the first failure, the parser
will backtrack and explore every alternative as it unwinds, before ultimately
giving up anyway.
Not only is this slow, but backtracking also discards the error thrown by the original failure, leading to error messages that are difficult to understand.
To address these issues, popular parser combinator libraries change the meaning
of (<|>)
. If the first alternative fails and no input has been consumed,
only then is the second alternative attempted. That is, if a parser processes
even a single character before failing, then the parser halts immediately with
failure.
Short-circuiting failure is indeed faster, but introduces a problem. At times
we really do want backtracking, that is, we want to undo consuming some of the
input before trying an alternative. Thus parser combinator libraries also
provide a try
function: it tries a given parser on the input, and on failure,
acts as if no input has been consumed so that alternatitves can be tried.
Naturally,
we should
use try
sparingly.
One way to implement this feature is to have parsers return a boolean indicating whether input has been consumed.
data Parser a = Parser { unParser :: String -> (Bool, Either String (a, String)) } instance Functor Parser where fmap f (Parser x) = Parser $ second (fmap $ first f) . x instance Applicative Parser where pure a = Parser $ (False,) . Right . (a,) f <*> x = Parser \s -> let (c0, mf) = unParser f s in case mf of Left err -> (c0, Left err) Right (fun, t) -> let (c1, mg) = unParser x t in case mg of Left err -> (c0 || c1, Left err) Right (arg, u) -> (c0 || c1, Right (fun arg, u)) instance Monad Parser where Parser f >>= g = Parser \s -> let (c0, mf) = f s in case mf of Left err -> (c0, Left err) Right (x, t) -> let (c1, mg) = unParser (g x) t in case mg of Left err -> (c0 || c1, Left err) r -> (c0 || c1, r) return = pure instance Alternative Parser where empty = Parser $ const (False, Left "") x <|> y = Parser \s -> let cm@(consumed, m) = unParser x s in case m of Left err -> if consumed then cm else unParser y s Right _ -> cm try p = Parser \s -> let cm@(_, m) = unParser p s in case m of Left err -> (False, Left err) _ -> cm sat f = Parser \case h:t | f h -> (True, Right (h, t)) _ -> (False, Left "unsat")
(We call this version Parser
instead of Charser
because my interprefer has
issues with redefining typeclasses.)
Here’s some examples showing how the new (<|>)
behaves:
char = sat . (==) string = mapM char unParser (string "foo" <|> string "bar") "foo" unParser (string "foo" <|> string "bar") "bar" unParser (string "foo" <|> string "bar") "baz" unParser (string "foo" <|> string "bar") "qux" unParser (string "qux" <|> string "quux") "quux" unParser (try (string "qux") <|> string "quux") "quux"
The evolution of parsing
As a child, I was baffled by parsers because I only understood loops. How does the computer handle arbitrarily nested parentheses?!
Parsers might have been fundamentally beyond my reach at the time, as some research suggests that 10-year-olds struggle with the concept of recursion, while 11-year-olds struggle with understanding recursive programs. As it happened, I was 12 or so when I read about recursive descent parsers. (For memory, the book was Herbert Schildt, Advanced Turbo C.) I was awestruck by their simplicity, elegance, and power. I felt I had leveled-up in real life. I now knew everything about parsers.
I was awestuck again on encountering the "red dragon book" at university. Forget those childish recursive descent parsers! Instead, the computer should do the hard work; the human should just write a handful of regular expressions and a BNF grammar. Not only is it simple and elegant, but there are strong guarantees: the parser generator produces efficient code and detects ambiguities. I felt I knew everything about parsers. This time for real.
I was awestruck yet again years later when I stumbled upon parser combinators. Forget those cumbersome parser generators! I had thought combinators were a mathematical curiosity; a fun alternative to Turing machines for studying computability. But parsing functions that return a parsing function in addition to performing other duties lead to amazingly succinct recursive descent parsers.
But the shocks didn’t stop. I stumbled upon parsing expression grammars, or PEGs, which can be parsed in linear time and are unambiguous. They can parse some non-context-free languages, and it is unknown if they can parse all context-free langauges.
And then there’s parsing with derivatives, which parse any context-free grammar in cubic time (on par with Earley’s algorithm or CYK). In brief, "even though Brzozowski defined the derivative with regular languages in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points."
I now feel I know little about parsers. In practice, I let expedience be my guide. If I must have features guaranteed by the theory, or if parser combinators are unavailable, then I’ll resort to a parser generator. Otherwise, I’ll choose a parser combinator library such as Megaparsec.