Each Yapps grammar has a name, a list of tokens, and a set of production rules. A grammar named X will be used to produce a parser named X and a scanner anmed XScanner. As in Python, names are case sensitive, start with a letter, and contain letters, numbers, and underscores (_).
There are three kinds of tokens in Yapps: named, inline, and ignored. As their name implies, named tokens are given a name, using the token construct: token name : regexp. In a rule, the token can be matched by using the name. Inline tokens are regular expressions that are used in rules without being declared. Ignored tokens are declared using the ignore construct: ignore: regexp. These tokens are ignored by the scanner, and are not seen by the parser. Often whitespace is an ignored token. The regular expressions used to define tokens should use the syntax defined in the re module, so some symbols may have to be backslashed.
Production rules in Yapps have a name and a pattern to match. If the
rule is parameterized, the name should be followed by a list of
parameter names in <<...>>
. A pattern can be a simple pattern
or a compound pattern. Simple patterns are the name of a named token,
a regular expression in quotes (inline token), the name of a
production rule (followed by arguments in <<...>>
, if the rule
has parameters), and single line Python statements ({{...}}
).
Compound patterns are sequences (A B C ...
), choices (
A | B | C | ...
), options ([...]
), zero-or-more repetitions
(...*
), and one-or-more repetitions (...+
). Like
regular expressions, repetition operators have a higher precedence
than sequences, and sequences have a higher precedence than choices.
Whenever {{...}}
is used, a legal one-line Python statement
should be put inside the braces. The token }}
should not
appear within the {{...}}
section, even within a string, since
Yapps does not attempt to parse the Python statement. A workaround
for strings is to put two strings together ("}" "}"
), or to use
backslashes ("}\}"
). At the end of a rule you should use a
{{ return X }}
statement to return a value. However, you
should not use any control statements (return,
continue, break) in the middle of a rule. Yapps
needs to make assumptions about the control flow to generate a parser,
and any changes to the control flow will confuse Yapps.
The <<...>>
form can occur in two places: to define parameters
to a rule and to give arguments when matching a rule. Parameters use
the syntax used for Python functions, so they can include default
arguments and the special forms (*args
and **kwargs
).
Arguments use the syntax for Python function call arguments, so they
can include normal arguments and keyword arguments. The token
>>
should not appear within the <<...>>
section.
In both the statements and rule arguments, you can use names defined by the parser to refer to matched patterns. You can refer to the text matched by a named token by using the token name. You can use the value returned by a production rule by using the name of that rule. If a name X is matched more than once (such as in loops), you will have to save the earlier value(s) in a temporary variable, and then use that temporary variable in the return value. The next section has an example of a name that occurs more than once.
Yapps produces ELL(1) parsers, which determine which clause to match based on the first token available. Sometimes the leftmost tokens of several clauses may be the same. The classic example is the if/then/else construct in Pascal:
rule stmt: "if" expr "then" stmt {{ then_part = stmt }} "else" stmt {{ return ('If',expr,then_part,stmt) }} | "if" expr "then" stmt {{ return ('If',expr,stmt,[]) }}
(Note that we have to save the first stmt into a variable because there is another stmt that will be matched.) The left portions of the two clauses are the same, which presents a problem for the parser. The solution is left-factoring: the common parts are put together, and then a choice is made about the remaining part:
rule stmt: "if" expr "then" stmt {{ then_part = stmt }} {{ else_part = [] }} [ "else" stmt {{ else_part = stmt }} ] {{ return ('If', expr, then_part, else_part) }}
Unfortunately, the classic if/then/else situation is still ambiguous when you left-factor. Yapps can deal with this situation, but will report a warning; see section for details.
In general, replace rules of the form:
rule A: a b1 {{ return E1 }} | a b2 {{ return E2 }} | c3 {{ return E3 }} | c4 {{ return E4 }}
with rules of the form:
rule A: a ( b1 {{ return E1 }} | b2 {{ return E2 }} ) | c3 {{ return E3 }} | c4 {{ return E4 }}
A common construct in grammars is for matching a list of patterns, sometimes separated with delimiters such as commas or semicolons. In LR-based parser systems, we can parse a list with something like this:
rule sum: NUM {{ return NUM }} | sum "+" NUM {{ return (sum, NUM) }}
Parsing 1+2+3+4 would produce the output (((1,2),3),4), which is what we want from a left-associative addition operator. Unfortunately, this grammar is left recursive, because the sum rule contains a clause that begins with sum. (The recursion occurs at the left side of the clause.)
We must restructure this grammar to be right recursive instead:
rule sum: NUM {{ return NUM }} | NUM "+" sum {{ return (NUM, sum) }}
Unfortunately, using this grammar, 1+2+3+4 would be parsed as (1,(2,(3,4))), which no longer follows left associativity. The rule also needs to be left-factored. Instead, we write the pattern as a loop:
rule sum: NUM {{ v = NUM }} ( "[+]" NUM {{ v = (v,NUM) }} )* {{ return v }}
In general, replace rules of the form:
rule A: A a1 -> << E1 >> | A a2 -> << E2 >> | b3 -> << E3 >> | b4 -> << E4 >>
with rules of the form:
rule A: ( b3 {{ A = E3 }} | b4 {{ A = E4 }} ) ( a1 {{ A = E1 }} | a2 {{ A = E2 }} )* {{ return A }}
We have taken a rule that proved problematic for with recursion and turned it into a rule that works well with looping constructs.
In section we saw the classic if/then/else ambiguity, which occurs because the “else …” portion of an “if …then …else …” construct is optional. Programs with nested if/then/else constructs can be ambiguous when one of the else clauses is missing:
if 1 then if 1 then if 5 then if 5 then x := 1; x := 1; else else y := 9; y := 9;
The indentation shows that the program can be parsed in two different ways. (Of course, if we all would adopt Python’s indentation-based structuring, this would never happen!) Usually we want the parsing on the left: the “else” should be associated with the closest “if” statement. In section we “solved” the problem by using the following grammar:
rule stmt: "if" expr "then" stmt {{ then_part = stmt }} {{ else_part = [] }} [ "else" stmt {{ else_part = stmt }} ] {{ return ('If', expr, then_part, else_part) }}
Here, we have an optional match of “else” followed by a statement. The ambiguity is that if an “else” is present, it is not clear whether you want it parsed immediately or if you want it to be parsed by the outer “if”.
Yapps will deal with the situation by matching when the else pattern when it can. The parser will work in this case because it prefers the first matching clause, which tells Yapps to parse the “else”. That is exactly what we want!
For ambiguity cases with choices, Yapps will choose the first matching choice. However, remember that Yapps only looks at the first token to determine its decision, so (a b | a c) will result in Yapps choosing a b even when the input is a c. It only looks at the first token, a, to make its decision.
Amit J Patel, amitp@cs.stanford.edu