Parser Combinators

Suppose we write a function to parse an Integer. What type should it have?

We might try String -> Integer:

parseInteger :: String -> Integer
parseInteger = foldl (\n d -> 10*n + fromIntegral (fromDigit d)) 0

fromDigit c = ord c - ord '0'

parseInteger "123" + parseInteger "456"

The read function generalizes this, and it suffices for some applications. Nevertheless, often we’d prefer a parser that also:

handles invalid input
allows trailing characters

Our design should take into account that we may want to parse any trailing characters with another function.

The hard way

One solution is a function that returns all valid ways to parse an Integer at the start of a string along with the remainder of the string. In particular, it returns the empty list on failure.

String -> [(Integer, String)]

The reads function generalizes this, and Doaitse Swierstra, Combinator Parsing: A Short Tutorial builds upon this approach.

However, we’ll insist on at most one correct way to parse a string. We also provide a mechanism for reporting an error message:

String -> Either String (Integer, String)

This type accommodates parser functions that take a string and return either an error message or the successfully parsed integer at the start of the string along with the remainder of the input.

With this design, we might write a function to parse a single digit as follows:

parseDigit = \case
  c:s' | '0' <= c, c <= '9' -> Right (fromDigit c, s')
  _ -> Left "want digit"

parseDigit "2b"
parseDigit "b2"

From this we can build a parser for a two-digit number:

parseTwoDigits s = do
  (a, s') <- parseDigit s
  (b, s'') <- parseDigit s'
  pure (10*a + b, s'')

parseTwoDigits "221b"
parseTwoDigits "2b21"

It works, but our design is infested with boilerplate. We must carefully manage the input string, ensuring that the unparsed remainder of one step is fed to the next. The pain worsens as we write more complex parsers.

We would prefer to write plumbing code like this once, and reuse it many times. This is easy in Haskell.

The easy way

We start by defining a data type to hold a parsing function, which we call a Charser, a sort of portmanteau meant to evoke a parser that handles one Char at a time. We also generalize so that our parsers can return any desired type.

data Charser a = Charser { unCharser :: String -> Either String (a, String) }

Next, we define some fundamental parsers. The sat parser returns the first character if it satisfies a given predicate, and fails otherwise. THe eof parser returns () if the given string is empty, and fails otherwise.

sat f = Charser \case
  h:t | f h -> Right (h, t)
  _ -> Left "unsat"

eof = Charser \case
  [] -> Right ((), "")
  _ -> Left "want EOF"

We can easily get at the function via unCharser:

unCharser (sat (== 'x')) "xyz"
unCharser (sat (== 'x')) "abc"
unCharser eof ""
unCharser eof "abc"

Lastly, we define Functor, Applicative, Monad, and Alternative typeclass instances for the Charser type. The Functor instance is standard stuff (that GHC can derive automatically) that lets us easily operate on a parsed value, while the others describe how to compose two parsers. The Monad and Applicative instances meticulously thread the input string from the first parser to the second parser, while the Alternative instance builds a parser that tries the first parser, and on failure tries the second parser.

These correpond to sequencing and the choice operator denoted by a vertical bar "|" in a BNF grammar, though in our framework we iterate through the choices in the order of appearance and take the first match.

instance Functor Charser where fmap f (Charser x) = Charser $ fmap (first f) . x
instance Applicative Charser where
  pure a = Charser $ Right . (a,)
  f <*> x = Charser \s -> do
    (fun, t) <- unCharser f s
    (arg, u) <- unCharser x t
    pure (fun arg, u)
instance Monad Charser where
  Charser f >>= g = Charser $ (uncurry (unCharser . g) =<<) . f
  return = pure
instance Alternative Charser where
  empty = Charser $ const $ Left ""
  x <|> y = Charser \s -> either (const $ unCharser y s) Right $ unCharser x s

And we’re done! We have just defined parser combinators, which we compose to build complex parsers without needing to juggle bits and pieces of input strings. Our two-digit parser example becomes:

digitChar = sat \c -> '0' <= c && c <= '9'

twoDigits = do
  a <- digitChar
  b <- digitChar
  pure $ 10*fromDigit a + fromDigit b

unCharser twoDigits "221b"
unCharser twoDigits "2b21"

Arbitrary-precision calculator

We demonstrate a more sophisticated parser that parses an integer arithmetic expression. Rather than store an expression in a tree data structure, we evaluate as we parse.

char :: Char -> Charser Char
char = sat . (==)

oneOf :: String -> Charser Char
oneOf s = sat (`elem` s)

chainl1 :: Charser a -> Charser (a -> a -> a) -> Charser a
chainl1 p infixOp = go =<< p where
  go x = (go =<< ($x) <$> infixOp <*> p) <|> pure x

chainr1 :: Charser a -> Charser (a -> a -> a) -> Charser a
chainr1 p infixOp = (&) <$> p <*>
  (flip <$> infixOp <*> chainr1 p infixOp <|> pure id)

sp :: Charser a -> Charser a
sp p = p <* many (oneOf " \n")  -- Eat trailing spaces.

spch = sp . char

op c f = spch c *> pure f

line = sp (pure ()) *> expr <* eof where
  num  = parseInteger <$> sp (some digitChar)
  atm  = num <|> spch '(' *> expr <* spch ')'
  pow  = chainr1 atm $ op '^' (^)
  una  = foldr (.) id <$> many (op '+' id <|> op '-' negate)
  fac  = una <*> pow
  term = chainl1 fac $ op '*' (*) <|> op '/' div <|> op '%' mod
  expr = chainl1 term $ op '+' (+) <|> op '-' (-)

unCharser line "3 + 4 * 5"

Our line parser closely resembles the grammar it parses:

num    ::= ('0'|..|'9'|'.')+
atm    ::= num | '(' expr ')'
pow    ::= atm ('^' atm )*
una    ::= ('+'|'-')*
fac    ::= una pow
term   ::= fac ( ('*'|'/'|'%') fac )*
expr   ::= term ( ('+'|'-') term )*

In fact, one might say the definition of line is an attribute grammar.

We omit the rules involving spaces, which are traditionally considered part of a separate phase called lexical analysis. With parser combinators, such divisions are blurrier, and it can be practical to combine the two phases into a single catch-all parsing phase.

We embed the parser in a calculator web widget. Type an expression into the textbox and press enter.

calc = do
  s <- jsEval "calcIn.value;"
  putStr $ case unCharser line s of
    Left err -> err
    Right (v, _) -> s ++ " = " ++ show v

jsEval [r|runme_out.insertAdjacentHTML('beforeend', `
<textarea style="border: solid 4px #999999" id="calcIn" rows="1" cols="40">8888+88 / 8+8 / 8+8 * 8</textarea>
<div id="calcOut"></div>
`);

calcIn.addEventListener("keydown", ev => {
  if (ev.keyCode == 13) {
    const r = repl.run("chat", ["Main"], "calc");
    calcOut.innerText = r.out;
    calcIn.value = "";
    ev.preventDefault();
  }
});
|]

The Fast Way

Suppose we have a complex grammar and a long input that fails to parse near the end. Due to our implementation of (<|>), after the first failure, the parser will backtrack and explore every alternative as it unwinds, before ultimately giving up anyway.

Not only is this slow, but backtracking also discards the error thrown by the original failure, leading to error messages that are difficult to understand.

To address these issues, popular parser combinator libraries change the meaning of (<|>). If the first alternative fails and no input has been consumed, only then is the second alternative attempted. That is, if a parser processes even a single character before failing, then the parser halts immediately with failure.

Short-circuiting failure is indeed faster, but introduces a problem. At times we really do want backtracking, that is, we want to undo consuming some of the input before trying an alternative. Thus parser combinator libraries also provide a try function: it tries a given parser on the input, and on failure, acts as if no input has been consumed so that alternatitves can be tried. Naturally, we should use try sparingly.

One way to implement this feature is to have parsers return a boolean indicating whether input has been consumed.

data Parser a = Parser
  { unParser :: String -> (Bool, Either String (a, String)) }

instance Functor Parser where
  fmap f (Parser x) = Parser $ second (fmap $ first f) . x
instance Applicative Parser where
  pure a = Parser $ (False,) . Right . (a,)
  f <*> x = Parser \s -> let (c0, mf) = unParser f s in case mf of
    Left err -> (c0, Left err)
    Right (fun, t) -> let (c1, mg) = unParser x t in case mg of
      Left err -> (c0 || c1, Left err)
      Right (arg, u) -> (c0 || c1, Right (fun arg, u))
instance Monad Parser where
  Parser f >>= g = Parser \s -> let (c0, mf) = f s in case mf of
    Left err -> (c0, Left err)
    Right (x, t) -> let (c1, mg) = unParser (g x) t in case mg of
      Left err -> (c0 || c1, Left err)
      r -> (c0 || c1, r)
  return = pure
instance Alternative Parser where
  empty = Parser $ const (False, Left "")
  x <|> y = Parser \s -> let cm@(consumed, m) = unParser x s in case m of
    Left err -> if consumed then cm else unParser y s
    Right _ -> cm

try p = Parser \s -> let cm@(_, m) = unParser p s in case m of
  Left err -> (False, Left err)
  _ -> cm

sat f = Parser \case
  h:t | f h -> (True, Right (h, t))
  _ -> (False, Left "unsat")

(We call this version Parser instead of Charser because my interprefer has issues with redefining typeclasses.)

Here’s some examples showing how the new (<|>) behaves:

char = sat . (==)
string = mapM char

unParser (string "foo" <|> string "bar") "foo"
unParser (string "foo" <|> string "bar") "bar"
unParser (string "foo" <|> string "bar") "baz"
unParser (string "foo" <|> string "bar") "qux"
unParser (string "qux" <|> string "quux") "quux"
unParser (try (string "qux") <|> string "quux") "quux"

The evolution of parsing

As a child, I was baffled by parsers because I only understood loops. How does the computer handle arbitrarily nested parentheses?!

Parsers might have been fundamentally beyond my reach at the time, as some research suggests that 10-year-olds struggle with the concept of recursion, while 11-year-olds struggle with understanding recursive programs. As it happened, I was 12 or so when I read about recursive descent parsers. (For memory, the book was Herbert Schildt, Advanced Turbo C.) I was awestruck by their simplicity, elegance, and power. I felt I had leveled-up in real life. I now knew everything about parsers.

I was awestuck again on encountering the "red dragon book" at university. Forget those childish recursive descent parsers! Instead, the computer should do the hard work; the human should just write a handful of regular expressions and a BNF grammar. Not only is it simple and elegant, but there are strong guarantees: the parser generator produces efficient code and detects ambiguities. I felt I knew everything about parsers. This time for real.

I was awestruck yet again years later when I stumbled upon parser combinators. Forget those cumbersome parser generators! I had thought combinators were a mathematical curiosity; a fun alternative to Turing machines for studying computability. But parsing functions that return a parsing function in addition to performing other duties lead to amazingly succinct recursive descent parsers.

But the shocks didn’t stop. I stumbled upon parsing expression grammars, or PEGs, which can be parsed in linear time and are unambiguous. They can parse some non-context-free languages, and it is unknown if they can parse all context-free langauges.

And then there’s parsing with derivatives, which parse any context-free grammar in cubic time (on par with Earley’s algorithm or CYK). In brief, "even though Brzozowski defined the derivative with regular languages in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points."

I now feel I know little about parsers. In practice, I let expedience be my guide. If I must have features guaranteed by the theory, or if parser combinators are unavailable, then I’ll resort to a parser generator. Otherwise, I’ll choose a parser combinator library such as Megaparsec.

Ben Lynn blynn@cs.stanford.edu 💡