Next: 6. Future Extensions Up: Yapps 1.0 Previous: 4. Customization

Subsections

5. Parser Mechanics

The parser class contains one method for each rule in the grammar. These methods return a pair, (value,index) where the value is a user-defined value and the index points to the next token that has not been used. The indices are the same as those passed to the token method in the scanner.

5.1 Parser Objects

Yapps produces as output two exception classes, a scanner class, a parser class, and a function parse that puts everything together. The parse function does not have to be used; instead, one can create a parser and scanner object and use them together for parsing.

def parse(rule, input):
    parser = X(XScanner(input))
    try: return getattr(parser, rule)()[0]
    except SyntaxError, s:
        try:
            import yapps
            yapps.print_error(input, s, parser.scanner)
        except ImportError:
            print "Syntax Error",s.msg,"on line",1+count(input[:s.pos], "\n")
    except NoMoreTokens:
        print "Ran out of input"

The parse function takes a name of a rule and an input string as input. It creates a scanner and parser object, then tries to execute the method in the parser object named rule. (The parser object contains one method for each of the rules in the grammar.) The parser method returns a pair: the value specified in the <<...>> part of the grammar, and the number of tokens that were read from the string. (The latter can be used if we wanted to parse something else from the string.) The parse function throws away the token count and just returns the first element of the tuple.

There are several situations in which the parse function would not be useful. If a different parser or scanner is being used, the token count is needed, or a exceptions are to be handled differently, a new parse function would be required. The supplied parse function can be used as a template for writing a function for your own needs. An example of a custom parse function is the generate function in Yapps.py.

5.2 Context Sensitive Scanner

Unlike most scanners, the scanner produced by Yapps can take into account the context in which tokens are needed, and try to match only good tokens. For example, in the grammar:

parser IniFile:
   token ID:   "[a-zA-Z_0-9]+"
   token VAL:  ".*"

   rule pair:  ID "[ \t]*=[ \t]*" VAL "\n"

we would like to scan lines of text and pick out a name/value pair. In a conventional scanner, the input string shell=progman.exe would be turned into a single token of type VAL. The Yapps scanner, however, knows that at the beginning of the line, an ID is expected, so it will return "shell" as a token of type ID. Later, it will return "progman.exe" as a token of type VAL.

Context sensitivity decreases the separation between scanner and parser, but it is useful in parsers like IniFile, where the tokens themselves are not unambiguous, but are unambiguous given a particular stage in the parsing process.

Unfortunately, context sensitivity can make it more difficult to detect errors in the input. For example, in parsing a Pascal-like language with “begin” and “end” as keywords, a context sensitive scanner would only match “end” as the END token if the parser is in a place that will accept the END token. If not, then the scanner would match “end” as an identifier. To disable the context sensitive scanner in Yapps, use the context-insensitive-scanner option, either from the command line or in the parser:

Parser X:
    option:  "context-insensitive-scanner"

5.3 Internal Variables

There are two internal variables that may be of use. The current rule begins scanning the input string at character position _start_. The index of the current token being matched is _pos_. The token can be retrieved by accessing the scanner object, self.scanner. For example, to retrieve the four-element tuple representing the first token of the current rule, use self.scanner.token(_pos_). A potentially useful combination of these two variables is to extract the portion of the input matched by the current rule. The second element of a token tuple is the end character position, so we want to extract the string from _start_ to self.scanner.token(_pos_)[1]:

  rule R: a b c -> 
   << self.scanner.input[_start_:self.scanner.token(_pos_)[1]] >>

5.4 Pre- and Post-Parser Code

Sometimes the parser code needs to rely on helper variables, functions, and classes. A Yapps grammar can optionally be surrounded by double percent signs, to separate the grammar from Python code.

... Python code ...
%%
... Yapps grammar ...
%%
... Python code ...

The second %% can be omitted if there is no Python code at the end, and the first %% can be omitted if there is no extra Python code at all. (To have code only at the end, both separators are required.)

5.5 Python 1.5 Regular Expressions

If you want to use “new style” Python 1.5 regular expressions add an option to your grammar before you define tokens:

parser X:
    option:   "use-new-regexps"
    token:    ...

Yapps will then generate code to use the new regular expression module (re) rather than the old one (regex).

5.6 Optimizations Performed by Yapps

Yapps does several things that simplify the generated parser code, to make it faster, shorter, and more readable. Known changes to the internal variable _pos_ are propagated to future statements instead of performing assignments to modify the variable. Also, some variables that would be assigned to, but never used, are not assigned. (The analysis of this situation is not as comprehensive as it could be.)

Yapps looks for a special case that commonly occurs when matching lists of patterns. Lists are commonly expressed with one of these two constructs:

    rule Xlist:       X Xlist -> << [X] + Xlist >>
                    |         -> << [] >>
    rule X:           ...

    rule Ylist<<r>>:  Y Ylist<<r+[Y]>> -> << Ylist >>
                    |                  -> << r >>
    rule Y:           ...

Both constructs produce the resulting list using recursion instead of looping (since looping is not available in Yapps). The second version, however, is called tail-recursive. To be a tail call, the return value in the recursive clause has to be simply the name of the last subrule matched; no computation can be performed on it before it is returned. The first clause of Ylist is a tail call because it returns Ylist, and the last subrule matched was Ylist<<...>>. To be tail recursive, it must be a tail call, and it must also be the same as the rule being defined. Since we are defining Ylist and returning Ylist, the clause is tail recursive.

Yapps will detect tail calls and make a minor optimization for them. The real win is in tail recursion: Yapps will turn the recursive function into a while loop.

Next: 6. Future Extensions Up: Yapps 1.0 Previous: 4. Customization

Amit J Patel, amitp@cs.stanford.edu