Next: 4. Customization Up: Yapps 1.0 Previous: 2. Examples

Subsections

3. Grammars

Each Yapps grammar has a name, a list of tokens, and a set of production rules. A grammar named X will be used to produce a parser named X and a scanner anmed XScanner. As in Python, names are case sensitive, start with a letter, and contain letters, numbers, and underscores (_).

There are three kinds of tokens in Yapps: named, inline, and ignored. As their name implies, named tokens are given a name, using the token construct: token name : regexp. In a rule, the token can be matched by using the name. Inline tokens are regular expressions that are used in rules without being declared. Ignored tokens are declared using the ignore construct: ignore: regexp. These tokens are ignored by the scanner, and are not seen by the parser. Often whitespace is an ignored token. The regular expressions used to define tokens are treated like those used in Python code, so backslashes and quotes may have to be doubled.

Production rules in Yapps ahve a name and a series of clauses, separated by bars (|): rule name : clause | …| clause. If the rule is parameterized, the name should be followed by a list of parameter names in <<...>>. Each clause is a list of patterns to match, an arrow (->), and a return value enclosed in <<...>>. A pattern can be the name of a named token, a regular expression in quotes (inline token), or the name of a production rule (followed by arguments in <<...>>, if the rule has parameters). The return value can use the text matched by a named token by using the token name. It can use the value returned by a production rule by using the name of that rule. If a name X occurs in more than one pattern, the return value can use X1, X2, etc. to refer to the occurrences of X, numbered from left to right.

Whenever <<...>> is used, a legal Python expression should be put inside the brackets. Python statements and comments are not allowed. The token >> should not appear within the <<...>> section, even within a string, since Yapps does not attempt to parse strings, parentheses, brackets, and braces within the <<...>>. A workaround for strings is to put two strings together (">" ">"), or to use backslashes (">\>").

3.1 Left Factoring

Yapps produces LL(1) parsers, which determine which clause to match based on the first token available. Sometimes the leftmost tokens of several clauses may be the same. The classic example is the if/then/else construct in Pascal:

rule stmt:  "if" expr "then" stmt "else" stmt  -> 
                << ('If',expr,stmt1,stmt1) >>
          | "if" expr "then" stmt ->
                << ('If',expr,stmt,[]) >>

The left portions of the two clauses are the same, which presents a problem for the parser. The solution is left-factoring: the common parts are put into a separate rule:

rule stmt:  "if" expr "then" stmt if_rest<<expr,stmt>>  ->  << if_rest >>
rule if_rest<<e,s>>:  "else" stmt -> << ('If',e,s,stmt) >>
                    |             -> << ('If',e,s,[]) >>

In general, replace rules of the form:

rule A:   a b1 -> << E1 >>
        | a b2 -> << E2 >>
        | c3   -> << E3 >>
        | c4   -> << E4 >>

with two rules of the form:

rule A:   a A_rest<<a>> -> << A_rest >>
        | c3   -> << E3 >>
        | c4   -> << E4 >>
rule A_rest<<a>>:   b1 -> << E1 >>
                  | b2 -> << E2 >>

For each rule, Yapps can determine which clause to match based on the first token available. Unfortunately, the classic if/then/else situation is still ambiguous when you left-factor. Yapps can deal with this situation, but will report a warning; see section 3.3 for details.

3.2 Left Recursion

A common construct in grammars is for matching a list of patterns, sometimes separated with delimiters such as commas or semicolons. In LR-based parser systems, we can parse a list with something like this:

rule sum:  NUM             -> << NUM >>
         | sum "+" NUM     -> << (sum, NUM) >>

Parsing 1+2+3+4 would produce the output (((1,2),3),4), which is what we want from a left-associative addition operator. Unfortunately, this grammar is left recursive, because the sum rule contains a clause that begins with sum. (The recursion occurs at the left side of the clause.)

We must restructure this grammar to be right recursive instead:

rule sum:  NUM             -> << NUM >>
         | NUM "+" sum     -> << (NUM, sum) >>

Unfortunately, using this grammar, 1+2+3+4 would be parsed as (1,(2,(3,4))), which no longer follows left associativity. The rule also needs to be left-factored. Instead, we write the following:

rule sum:               NUM sum_rest<<NUM>> -> << sum_rest >>
rule sum_rest<<left>>:                                 -> << left >>
                      | "+" NUM sum_rest<<(left,NUM)>> -> << sum_rest >>

Oddly enough, after left-factoring the resulting parser produces the desired left-associative parse tree. This version is also tail-recursive, so the resulting parser will be a loop instead of a recursive Python function.

In general, replace rules of the form:

rule A:  A a1 -> << E1 >> 
       | A a2 -> << E2 >>
       | b3   -> << E3 >>
       | b4   -> << E4 >>

with two rules of the form:

rule A:  b3 A_rest<<E3>> -> << A_rest >>
       | b4 A_rest<<E4>> -> << A_rest >>
rule A_rest<<A>>:  a1 A_rest<<E1>> -> << A_rest >>
                 | a2 A_rest<<E2>> -> << A_rest >>

As with left-factoring, we have taken a rule that proved problematic for Yapps (and other LL(1) parser generators) and split it into two rules. With Yapps we can use parameterized rules to pass information from one rule to the other, so using two rules instead of one does not lead to any signficant restrictions on the kinds of grammars that can be written.

3.3 Ambiguous Grammars

In section 3.1 we saw the classic if/then/else ambiguity, which occurs because the “else …” portion of an “if …then …else …” construct is optional. Programs with nested if/then/else constructs can be ambiguous when one of the else clauses is missing:

if 1 then            if 1 then
    if 5 then            if 5 then
        x := 1;              x := 1;
    else             else
        y := 9;          y := 9;

The indentation shows that the program can be parsed in two different ways. (Of course, if we all would adopt Python’s indentation-based structuring, this would never happen!) Usually we want the parsing on the left: the “else” should be associated with the closest “if” statement. In section 3.1 we “solved” the problem by using the following grammar:

rule stmt:  "if" expr "then" stmt if_rest<<expr,stmt>>  ->  << if_rest >>
rule if_rest<<e,s>>:  "else" stmt -> << ('If',e,s,stmt) >>
                    |             -> << ('If',e,s,[]) >>

Here, we have two clauses—one in which “else” is parsed and one in which nothing is parsed. The ambiguity is that if an “else” is present, it is not clear whether you want it parsed immediately or if you want it to be parsed by the outer “if”.

Yapps will deal with the situation by reporting that there is an ambiguity (warning you that the parser may not work), but it will still produce a parser. Despite the warning, the parser will work in this case because it prefers the first matching clause, which tells Yapps to parse the “else”. That is exactly what we want!

If you have the if/then/else ambiguity in your grammar, be sure to put the “else” clause before the empty clause. Yapps will produce a warning, but the parser will most likely work.

Next: 4. Customization Up: Yapps 1.0 Previous: 2. Examples

Amit J Patel, amitp@cs.stanford.edu