Subsections

5 Parser Mechanics

The base parser class (Parser) defines two methods, _scan and _peek, and two fields, _pos and _scanner. The generated parser inherits from the base parser, and contains one method for each rule in the grammar. To avoid name clashes, do not use names that begin with an underscore (_).

5.1 Parser Objects

Yapps produces as output two exception classes, a scanner class, a parser class, and a function parse that puts everything together. The parse function does not have to be used; instead, one can create a parser and scanner object and use them together for parsing.

    def parse(rule, text):
        P = X(XScanner(text))
        return wrap_error_reporter(P, rule)

The parse function takes a name of a rule and an input string as input. It creates a scanner and parser object, then calls wrap_error_reporter to execute the method in the parser object named rule. The wrapper function will call the appropriate parser rule and report any parsing errors to standard output.

There are several situations in which the parse function would not be useful. If a different parser or scanner is being used, or exceptions are to be handled differently, a new parse function would be required. The supplied parse function can be used as a template for writing a function for your own needs. An example of a custom parse function is the generate function in Yapps.py.

5.2 Context Sensitive Scanner

Unlike most scanners, the scanner produced by Yapps can take into account the context in which tokens are needed, and try to match only good tokens. For example, in the grammar:

parser IniFile:
   token ID:   "[a-zA-Z_0-9]+"
   token VAL:  ".*"

   rule pair:  ID "[ \t]*=[ \t]*" VAL "\n"

we would like to scan lines of text and pick out a name/value pair. In a conventional scanner, the input string shell=progman.exe would be turned into a single token of type VAL. The Yapps scanner, however, knows that at the beginning of the line, an ID is expected, so it will return “shell” as a token of type ID. Later, it will return “progman.exe” as a token of type VAL.

Context sensitivity decreases the separation between scanner and parser, but it is useful in parsers like IniFile, where the tokens themselves are not unambiguous, but are unambiguous given a particular stage in the parsing process.

Unfortunately, context sensitivity can make it more difficult to detect errors in the input. For example, in parsing a Pascal-like language with “begin” and “end” as keywords, a context sensitive scanner would only match “end” as the END token if the parser is in a place that will accept the END token. If not, then the scanner would match “end” as an identifier. To disable the context sensitive scanner in Yapps, add the context-insensitive-scanner option to the grammar:

Parser X:
    option:  "context-insensitive-scanner"

Context-insensitive scanning makes the parser look cleaner as well.

5.3 Internal Variables

There are two internal fields that may be of use. The parser object has two fields, _pos, which is the index of the current token being matched, and _scanner, which is the scanner object. The token itself can be retrieved by accessing the scanner object and calling the token method with the token index. However, if you call token before the token has been requested by the parser, it may mess up a context-sensitive scanner. (When using a context-sensitive scanner, the parser tells the scanner what the valid token types are at each point. If you call token before the parser can tell the scanner the valid token types, the scanner will attempt to match without considering the context.) A potentially useful combination of these fields is to extract the portion of the input matched by the current rule. To do this, just save the scanner state (_scanner.pos) before the text is matched and then again after the text is matched:

  rule R: 
      {{ start = self._scanner.pos }}
      a b c 
      {{ end = self._scanner.pos }}
      {{ print 'Text is', self._scanner.input[start:end] }}

5.4 Pre- and Post-Parser Code

Sometimes the parser code needs to rely on helper variables, functions, and classes. A Yapps grammar can optionally be surrounded by double percent signs, to separate the grammar from Python code.

... Python code ...
%%
... Yapps grammar ...
%%
... Python code ...

The second %% can be omitted if there is no Python code at the end, and the first %% can be omitted if there is no extra Python code at all. (To have code only at the end, both separators are required.)

If the second %% is omitted, Yapps will insert testing code that allows you to use the generated parser to parse a file.

The extended calculator example in the Yapps examples subdirectory includes both pre-parser and post-parser code.

5.5 Representation of Grammars

For each kind of pattern there is a class derived from Pattern. Yapps has classes for Terminal, NonTerminal, Sequence, Choice, Option, Plus, Star, and Eval. Each of these classes has the following interface:

setup(gen): Set accepts-ϵ, and call gen.changed() if it changed. This function can change the flag from false to true but not from true to false.
update((gen)): Set FIRST and FOLLOW, and call gen.changed() if either changed. This function can add to the sets but not remove from them.
output(gen, indent): Generate code for matching this rule, using indent as the current indentation level. Writes are performed using gen.write.
used(vars): Given a list of variables vars, return two lists: one containing the variables that are used, and one containing the variables that are assigned. This function is used for optimizing the resulting code.

Both setup and update monotonically increase the variables they modify. Since the variables can only increase a finite number of times, we can repeatedly call the function until the variable stabilized. The used function is not currently implemented.

With each pattern in the grammar Yapps associates three pieces of information: the FIRST set, the FOLLOW set, and the accepts-ϵ flag.

The FIRST set contains the tokens that can appear as we start matching the pattern. The FOLLOW set contains the tokens that can appear immediately after we match the pattern. The accepts-ϵ flag is true if the pattern can match no tokens. In this case, FIRST will contain all the elements in FOLLOW. The FOLLOW set is not needed when accepts-ϵ is false, and may not be accurate in those cases.

Yapps does not compute these sets precisely. Its approximation can miss certain cases, such as this one:

  rule C: ( A* | B )
  rule B: C [A]

Yapps will calculate C’s FOLLOW set to include A. However, C will always match all the A’s, so A will never follow it. Yapps 2.0 does not properly handle this construct, but if it seems important, I may add support for it in a future version.

Yapps also cannot handle constructs that depend on the calling sequence. For example:

  rule R: U | 'b'
  rule S:   | 'c'
  rule T: S 'b'
  rule U: S 'a'

The FOLLOW set for S includes a and b. Since S can be empty, the FIRST set for S should include a, b, and c. However, when parsing R, if the lookahead is b we should not parse U. That’s because in U, S is followed by a and not b. Therefore in R, we should choose rule U only if there is an a or c, but not if there is a b. Yapps and many other LL(1) systems do not distinguish S b and S a, making S’s FOLLOW set a, b, and making R always try to match U. In this case we can solve the problem by changing R to 'b' | U but it may not always be possible to solve all such problems in this way.

Amit J Patel, amitp@cs.stanford.edu