Subsections

3 Grammars

Each Yapps grammar has a name, a list of tokens, and a set of production rules. A grammar named X will be used to produce a parser named X and a scanner anmed XScanner. As in Python, names are case sensitive, start with a letter, and contain letters, numbers, and underscores (_).

There are three kinds of tokens in Yapps: named, inline, and ignored. As their name implies, named tokens are given a name, using the token construct: token name : regexp. In a rule, the token can be matched by using the name. Inline tokens are regular expressions that are used in rules without being declared. Ignored tokens are declared using the ignore construct: ignore: regexp. These tokens are ignored by the scanner, and are not seen by the parser. Often whitespace is an ignored token. The regular expressions used to define tokens should use the syntax defined in the re module, so some symbols may have to be backslashed.

Production rules in Yapps have a name and a pattern to match. If the rule is parameterized, the name should be followed by a list of parameter names in <<...>>. A pattern can be a simple pattern or a compound pattern. Simple patterns are the name of a named token, a regular expression in quotes (inline token), the name of a production rule (followed by arguments in <<...>>, if the rule has parameters), and single line Python statements ({{...}}). Compound patterns are sequences (A B C ...), choices ( A | B | C | ...), options ([...]), zero-or-more repetitions (...*), and one-or-more repetitions (...+). Like regular expressions, repetition operators have a higher precedence than sequences, and sequences have a higher precedence than choices.

Whenever {{...}} is used, a legal one-line Python statement should be put inside the braces. The token }} should not appear within the {{...}} section, even within a string, since Yapps does not attempt to parse the Python statement. A workaround for strings is to put two strings together ("}" "}"), or to use backslashes ("}\}"). At the end of a rule you should use a {{ return X }} statement to return a value. However, you should not use any control statements (return, continue, break) in the middle of a rule. Yapps needs to make assumptions about the control flow to generate a parser, and any changes to the control flow will confuse Yapps.

The <<...>> form can occur in two places: to define parameters to a rule and to give arguments when matching a rule. Parameters use the syntax used for Python functions, so they can include default arguments and the special forms (*args and **kwargs). Arguments use the syntax for Python function call arguments, so they can include normal arguments and keyword arguments. The token >> should not appear within the <<...>> section.

In both the statements and rule arguments, you can use names defined by the parser to refer to matched patterns. You can refer to the text matched by a named token by using the token name. You can use the value returned by a production rule by using the name of that rule. If a name X is matched more than once (such as in loops), you will have to save the earlier value(s) in a temporary variable, and then use that temporary variable in the return value. The next section has an example of a name that occurs more than once.

3.1 Left Factoring

Yapps produces ELL(1) parsers, which determine which clause to match based on the first token available. Sometimes the leftmost tokens of several clauses may be the same. The classic example is the if/then/else construct in Pascal:

rule stmt:  "if" expr "then" stmt {{ then_part = stmt }} 
                      "else" stmt {{ return ('If',expr,then_part,stmt) }}
          | "if" expr "then" stmt {{ return ('If',expr,stmt,[]) }}

(Note that we have to save the first stmt into a variable because there is another stmt that will be matched.) The left portions of the two clauses are the same, which presents a problem for the parser. The solution is left-factoring: the common parts are put together, and then a choice is made about the remaining part:

rule stmt:  "if" expr 
              "then" stmt {{ then_part = stmt }}
              {{ else_part = [] }}
              [ "else" stmt {{ else_part = stmt }} ]
            {{ return ('If', expr, then_part, else_part) }}

Unfortunately, the classic if/then/else situation is still ambiguous when you left-factor. Yapps can deal with this situation, but will report a warning; see section for details.

In general, replace rules of the form:

rule A:   a b1 {{ return E1 }}
        | a b2 {{ return E2 }}
        | c3   {{ return E3 }}
        | c4   {{ return E4 }}

with rules of the form:

rule A:   a ( b1 {{ return E1 }}
            | b2 {{ return E2 }}
            )
        | c3   {{ return E3 }}
        | c4   {{ return E4 }}

3.2 Left Recursion

A common construct in grammars is for matching a list of patterns, sometimes separated with delimiters such as commas or semicolons. In LR-based parser systems, we can parse a list with something like this:

rule sum:  NUM             {{ return NUM }}
         | sum "+" NUM     {{ return (sum, NUM) }}

Parsing 1+2+3+4 would produce the output (((1,2),3),4), which is what we want from a left-associative addition operator. Unfortunately, this grammar is left recursive, because the sum rule contains a clause that begins with sum. (The recursion occurs at the left side of the clause.)

We must restructure this grammar to be right recursive instead:

rule sum:  NUM             {{ return NUM }}
         | NUM "+" sum     {{ return (NUM, sum) }}

Unfortunately, using this grammar, 1+2+3+4 would be parsed as (1,(2,(3,4))), which no longer follows left associativity. The rule also needs to be left-factored. Instead, we write the pattern as a loop:

rule sum:       NUM {{ v = NUM }}
                ( "[+]" NUM {{ v = (v,NUM) }} )*
                {{ return v }}

In general, replace rules of the form:

rule A:  A a1 -> << E1 >> 
       | A a2 -> << E2 >>
       | b3   -> << E3 >>
       | b4   -> << E4 >>

with rules of the form:

rule A:  ( b3 {{ A = E3 }} 
         | b4 {{ A = E4 }} )
         ( a1 {{ A = E1 }}
         | a2 {{ A = E2 }} )*
         {{ return A }}

We have taken a rule that proved problematic for with recursion and turned it into a rule that works well with looping constructs.

3.3 Ambiguous Grammars

In section we saw the classic if/then/else ambiguity, which occurs because the “else …” portion of an “if …then …else …” construct is optional. Programs with nested if/then/else constructs can be ambiguous when one of the else clauses is missing:

if 1 then            if 1 then
    if 5 then            if 5 then
        x := 1;              x := 1;
    else             else
        y := 9;          y := 9;

The indentation shows that the program can be parsed in two different ways. (Of course, if we all would adopt Python’s indentation-based structuring, this would never happen!) Usually we want the parsing on the left: the “else” should be associated with the closest “if” statement. In section we “solved” the problem by using the following grammar:

rule stmt:  "if" expr 
              "then" stmt {{ then_part = stmt }}
              {{ else_part = [] }}
              [ "else" stmt {{ else_part = stmt }} ]
            {{ return ('If', expr, then_part, else_part) }}

Here, we have an optional match of “else” followed by a statement. The ambiguity is that if an “else” is present, it is not clear whether you want it parsed immediately or if you want it to be parsed by the outer “if”.

Yapps will deal with the situation by matching when the else pattern when it can. The parser will work in this case because it prefers the first matching clause, which tells Yapps to parse the “else”. That is exactly what we want!

For ambiguity cases with choices, Yapps will choose the first matching choice. However, remember that Yapps only looks at the first token to determine its decision, so (a b | a c) will result in Yapps choosing a b even when the input is a c. It only looks at the first token, a, to make its decision.

Amit J Patel, amitp@cs.stanford.edu