Next: 3. Grammars Up: Yapps 1.0 Previous: 1. Introduction

Subsections

2. Examples

In this section are several examples that show the use of Yapps. First, an introduction shows how to construct grammars and write them in Yapps form. This example can be skipped by someone familiar with grammars and parsing. Next is a Lisp expression grammar that produces a parse tree as output. This example demonstrates the use of tokens and rules, as well as returning values from rules. The third example is a calculator grammar that evaluates expressions during the process of parsing instead of producing a parse tree. This example also shows the parameterization of rules. The examples in this document use old-style (Python 1.4) regular expressions, except where noted. Yapps can also be used with new-style (Python 1.5) regular expressions; see section 5.5 for details.

2.1 Introduction to Grammars

A grammar for a natural language specifies how words can be put together to form large structures, such as phrases and sentences. A grammar for a computer language is similar in that it specifies how small components (called tokens) can be put together to form larger structures. In this section we will write a grammar for a tiny subset of English.

Simple English sentences can be described as being a noun phrase followed by a verb followed by a noun phrase. For example, in the sentence, “Jack sank the blue ship,” the word “Jack” is the first noun phrase, “sank” is the verb, and “the blue ship” is the second noun phrase. In addition we should say what a noun phrase is; for this example we shall say that a noun phrase is an optional article (a, an, the) followed by any number of adjectives followed by a noun. The tokens in our language are the articles, nouns, verbs, and adjectives. The rules in our language will tell us how to combine the tokens together to form lists of adjectives, noun phrases, and sentences:

sentence: noun_phrase verb noun_phrase
noun_phrase: opt_article adjective_list noun
opt_article: article or (nothing)
adjective_list: adjective adjective_list or (nothing)

Notice that some things that we said easily in English, such as “optional article,” are expressed by a separate rule. When we said, “any number of adjectives,” we made a rule adjective_list that made it explicit: an adjective list is an adjective followed by a list of adjectives, or it’s nothing at all.

A minor note about recursive descent parsers, like those that Yapps produces: the parser looks from left to right, so the rules have to be specified so that it can determine which alternative to use based on what’s on the left. This is why we wrote that an adjective list was adjective adjective_list rather than adjective_list adjective. It is easy for Yapps to see that something starts with an adjective on the left, but it is much more difficult to determine whether it starts with an adjective list.

The grammar given above is close to a Yapps grammar. We also have to specify what the tokens are, and what to return every time a rule is matched. We’ll just use …for the return value, since they aren’t important to this example. Return values (-> <<...>>) will be described in the next example.

parser TinyEnglish:
  ignore:          "[ \t\n\r]+"
  token noun:      "\(Jack\|spam\|ship\)"
  token verb:      "\(sank\|threw\)"
  token article:   "\(a\|an\|the\)"
  token adjective: "\(blue\|red\|green\)"

  rule sentence:       noun_phrase verb noun_phrase     ->  <<…>>
  rule noun_phrase:    opt_article adjective_list noun  ->  <<…>>
  rule opt_article:    article                          ->  <<…>>
                     |                                  ->  <<…>>
  rule adjective_list: adjective adjective_list         ->  <<…>>
                     |                                  ->  <<…>>

The tokens are specified as Python regular expressions. Since Yapps produces Python code, you can write any regular expression that would be accepted by Python, and you should use extra backslashes when necessary. (Note: By default, Yapps uses old-style regular expressions; see section 5.5 if you want to use Python 1.5 style regular expressions.) In addition to tokens that you want to see (which are given names), you can also specify tokens to ignore, marked by the ignore keyword. In this parser we want to ignore whitespace.

The TinyEnglish grammar shows how you define tokens and rules, but it does not specify what should happen once we’ve matched the rules. In the next example, we will take a grammar and produce a parse tree from it.

2.2 Lisp Expressions

Lisp syntax, although hated by many, has a redeeming quality: it is simple to parse. In this section we will construct a Yapps grammar to parse Lisp expressions and produce a parse tree as output.

Defining the Grammar

The syntax of Lisp is simple. It has expressions, which are identifiers, strings, numbers, and lists. A list is a left parenthesis followed by some number of expressions (separated by spaces) followed by a right parenthesis. For example, 5, "ni", and (print "1+2 = " (+ 1 2)) are Lisp expressions. Written as a grammar,

    expr:   ID | STR | NUM | list
    list:   ( seq )  
    seq:       | expr seq

In addition to having a grammar, we need to specify what value to return every time something is matched. For the tokens, which are strings, we just want to get the “value” of the token, attach its type (identifier, string, or number) in some way, and return it. For the lists, we want to construct and return a Python list.

The value to return is a Python expression enclosed in <<...>>. Within this expression, we can refer to the values returned by matching each part of the rule. In matching expr seq, we can refer to “expr” as the value returned by matching the expr part and to “seq” as the value returned by matching the seq part. Let’s take a look at the rule:

    rule seq:                    ->  << [] >>
               | expr seq        ->  << [expr] + seq >>

If the sequence is empty, then we’ll return an empty list. If it’s not empty, then it’s an expression followed by the rest of the sequence. The seq rule will return a list, so we can produce a new list by adding the one-element list [expr] to seq.

In a rule, tokens return the text that was matched. We may need to convert this text to a different form. For example, if a string is matched, we want to remove quotes and handle special forms like \n. If a number is matched, we want to convert it into a number. Also, for the Lisp parser we are going to give them type tags. Let’s look at the return values for each of the tokens:

    rule expr:   ID              ->  << ('id',ID) >>
               | STR             ->  << ('str',eval(STR)) >>
               | NUM             ->  << ('num',atoi(NUM)) >>

If we get an identifier, then we just add the 'id' tag and return the type and the value as a tuple. If we get a string, we want to remove the quotes and process any special backslash codes, so we run eval on the quoted string. If we get a number, we convert it to an integer with atoi and then return the number along with its type tag.

The complete grammar is specified as follows:

parser Lisp:
    ignore:      '[ \t\n\r]+'
    token NUM:   '[0-9]+'
    token ID:    '[-+*/!@%^&=.a-zA-Z0-9_]+' 
    token STR:   '"\([^\"]+\|\\.\)*"'

    rule expr:   ID              ->  << ('id',ID) >>
               | STR             ->  << ('str',eval(STR)) >>
               | NUM             ->  << ('num',atoi(NUM)) >>
               | list            ->  << list >>
    rule list:   "(" seq ")"     ->  << seq >>
    rule seq:                    ->  << [] >>
               | expr seq        ->  << [expr] + seq >>

One thing you may have noticed is that "(" and ")" appear in the list rule. These are inline tokens: they appear in the rules without being given a name with the token keyword. Inline tokens are more convenient to use, but since they do not have a name, the text that is matched cannot be used in the return value. They are best used for short simple patterns (usually punctuation).

Another thing to notice is that the number and identifier tokens overlap. For example, “487” matches both NUM and ID. In Yapps, the scanner only tries to match tokens that are acceptable to the parser. This rule doesn’t help here, since both NUM and ID can appear in the same place in the grammar. There are two rules used to pick tokens if more than one matches. One is that the longest match is preferred. For example, “487x” will match as an ID (487x) rather than as a NUM (487) followed by an ID (x). The second rule is that if the two matches are the same length, the first one listed in the grammar is preferred. For example, “487” will match as an NUM rather than an ID because NUM is listed first in the grammar. Inline tokens have preference over any tokens you have listed.

Now that our grammar is defined, we can run Yapps to produce a parser, and then run the parser to produce a parse tree.

Running Yapps

In the Yapps module is a function generate that takes an input filename and writes a parser to another file. We can use this function to generate the Lisp parser, which is assumed to be in lisp.g.

% python
Python 1.4 (Nov 17 1996)  [GCC 2.7.2.1]
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import yapps
>>> yapps.generate('lisp.g')

At this point, Yapps has written a file lisp.py that contains the parser. In that file are two classes (one scanner and one parser) and a function (called parse) that puts things together for you.

Alternatively, we can run Yapps from the command line to generate the parser file:

% python yapps.py lisp.g

After running Yapps either from within Python or from the command line, we can use the Lisp parser by calling the parse function. The first parameter should be the rule we want to match, and the second parameter should be the string to parse.

>>> import lisp
>>> lisp.parse('expr', '(+ 3 4)')
[('id', '+'), ('num', 3), ('num', 4)]
>>> lisp.parse('expr', '(print "3 = " (+ 1 2))')
[('id', 'print'), ('str', '3 = '), [('id', '+'), ('num', 1), ('num', 2)]]

The parse function is not the only way to use the parser; section 5.1 describes how to access parser objects directly.

We’ve now gone through the steps in creating a grammar, writing a grammar file for Yapps, producing a parser, and using the parser. In the next example we’ll see how rules can take parameters and also how to do computations instead of just returning a parse tree.

2.3 Calculator

A common example parser given in many textbooks is that for simple expressions, with numbers, addition, subtraction, multiplication, division, and parenthesization of subexpressions. We’ll write this example in Yapps, using parameterization of rules to pass partially computed values from one rule to another.

Unlike yacc, Yapps does not have any way to specify precedence rules, so we have to do it ourselves. We say that an expression is the sum of terms, and that a term is the product of factors, and that a factor is a number or a parenthesized expression:

    expr:           factor expr_tail
    expr_tail:         | "+" factor expr_tail | "-" factor expr_tail
    factor:         term factor_tail
    factor_tail:       | "*" term factor_tail | "/" term factor_tail
    term:           NUM | "(" expr ")"

In order to evaluate the expression as we go, we should keep along an accumulator while evaluating the lists of terms or factors. In the clause "+" factor expr_tail, we want to add the evaluation of factor to the accumulator, and then pass it to expr_tail so that it can continue adding to or subtracting from the accumulator. The accumulator can be a parameter passed to the two tail rules. When we call the tail rule again, we pass in a new value for the accumulator. The full grammar is given below:

parser Calculator:
  token END: "\'"
  token NUM: "[0-9]+"

  rule goal:           expr END                         -> << expr        >>
  rule expr:           factor expr_tail<<factor>>       -> << expr_tail   >>
  rule expr_tail<<v>>:                                  -> << v           >>
                     | "+" factor expr_tail<<v+factor>> -> << expr_tail   >>
                     | "-" factor expr_tail<<v-factor>> -> << expr_tail   >>
  rule factor:         term factor_tail<<term>>         -> << factor_tail >>
  rule factor_tail<<v>>:                                -> << v           >>
                     | "*" term factor_tail<<v*term>>   -> << factor_tail >>
                     | "/" term factor_tail<<v/term>>   -> << factor_tail >>
  rule term:           NUM                              -> << atoi(NUM)   >>
                     | "(" expr ")"                     -> << expr        >>

The top-level rule is goal, which says that we are looking for an expression followed by the end of the string. The END token is needed because without it, it isn’t clear when to stop parsing. For example, the string “1+3” could be parsed either as the expression “1” followed by the string “+3” or it could be parsed as the expression “1+3”. By requiring expressions to end with END, the parser is forced to take “1+3”.

In the two tail rules, the accumulator is named v. Before calling the tail rule again, the accumulator is modified by adding, subtracting, multiplying by, or dividing by the first element of the list. The recursive call to the tail rule will return the value of the accumulator after all changes have been made, so we just return the value that the recursive call returned. Those of you familiar with Scheme or ML programming will recognize this as a tail-recursive accumulator-style loop. Yapps recognizes this kind of recursion, and produces a while loop instead of a recursive function; see section 5.6 for details.

The calculator example shows how to process lists of elements using parameters to rules, as well as how to handle precedence of operators. In an appendix, we will extend the calculator example to work with variables and assignment statements.

Note: It’s often important to put the END token in, so put it in unless you are sure that your grammar has some other non-ambiguous token marking the end of the program.

Next: 3. Grammars Up: Yapps 1.0 Previous: 1. Introduction

Amit J Patel, amitp@cs.stanford.edu