ANTLR Tree Parsers

Or, The Entity Formerly Known As SORCERER

ANTLR 2.xx helps you build intermediate form trees (ASTs) by augmenting a grammar with tree operators, rewrite rules, and actions. ANTLR also allows you to specify the grammatical structure of ASTs, thus, supporting the manipulation or simple walking of trees to produce translations.

Formerly, a separate tool, SORCERER, was used to generate tree parsers, but ANTLR has taken over this role. ANTLR now builds recognizers for streams of characters, tokens, or tree nodes.

What's a tree parser?

Parsing is the application of grammatical structure to a stream of input symbols. ANTLR takes this further than most tools and considers a tree to be a stream of nodes, albeit in two dimensions. In fact, the only real difference in ANTLR's code generation for token stream parsing versus tree parsing lies in the testing of lookahead, rule-method definition headers, and the introduction of a two-dimensional tree structure code-generation template.

What kinds of trees can be parsed?

ANTLR tree parsers can walk any tree that implements the AST interface, which imposes a child-sibling like structure to whatever tree data-structure you may have. The important navigation methods are:

getFirstChild: Return a reference to the first child of the sibling list.
getNextSibling: Return a reference to the next child in the list of siblings.

Each AST node is considered to have a list of children, some text, and a "token type". Trees are self-similar in that a tree node is also a tree. An AST is defined completely as:


/** Minimal AST node interface used by ANTLR AST generation
 * and tree-walker.
 */
public interface AST {
    /** Add a (rightmost) child to this node */
    public void addChild(AST c);
    public boolean equals(AST t);
    public boolean equalsList(AST t);
    public boolean equalsListPartial(AST t);
    public boolean equalsTree(AST t);
    public boolean equalsTreePartial(AST t);
    public ASTEnumeration findAll(AST tree);
    public ASTEnumeration findAllPartial(AST subtree);
    /** Get the first child of this node; null if no children */
    public AST getFirstChild();
    /** Getthe next sibling in line after this one */
    public AST getNextSibling();
    /** Get the token text for this node */
    public String getText();
    /** Get the token type for this node */
    public int getType();
    /** Get number of children of this node; if leaf, returns 0 */
    public int getNumberOfChildren();
    public void initialize(int t, String txt);
    public void initialize(AST t);
    public void initialize(Token t);
    /** Set the first child of a node. */
    public void setFirstChild(AST c);
    /** Set the next sibling after this one. */
    public void setNextSibling(AST n);
    /** Set the token text for this node */
    public void setText(String text);
    /** Set the token type for this node */
    public void setType(int ttype);
    public String toString();
    public String toStringList();
    public String toStringTree();
}

Tree grammar rules

As with the SORCERER tool of PCCTS 1.33 and the ANTLR token grammars, tree grammars are collections of EBNF rules embedded with actions, semantic predicates, and syntactic predicates.


rule:	alternative1
    |	alternative2
   ...
    |	alternativen
    ;

Each alternative production is composed of a list of elements where an element can be one of the items in a regular ANTLR grammar with the addition of the tree pattern element, which has the form:


#( root-token child1 child2 ... childn )

For example, the following tree pattern matches a simple PLUS-rooted tree with two INT children:=


#( PLUS INT INT )

The root of a tree pattern must be a token reference, but the children elements can even be subrules. For example, a common structure is an if-then-else tree where the else-clause statement subtree is optional:


#( IF expr stat (stat)? )

An important thing to remember when specifying tree patterns and tree grammars in general is that sufficient matches are done not exact matches. As long as the tree satistfies the pattern, a match is reported, regardless of how much is left unparsed. For example, #( A B ) will report a match for any larger tree with the same structure such as #( A #(B C) D).

Syntactic predicates

ANTLR tree parsers use only a single symbol of lookahead, which is normally not a problem as intermediate forms are explicitly designed to be easy to walk. However, there is occasionally the need to distinguish between similar tree structures. Syntactic predicates can be used to overcome the limitations of limited fixed lookahead. For example, distinguishing between the unary and binary minus operator is best done by using operator nodes of differing token types, but given the same root node, a syntactic predicate can be used to distinguish between these structures:


expr:   ( #(MINUS expr expr) )=> #( MINUS expr expr )
    |   #( MINUS expr )
   ...
    ;

The order of evaluation is very important as the second alternative is a "subset" of the first alternative.

Semantic predicates

Semantic predicates at the start of an alternative are simply incorporated into the alternative prediction expressions as with a regular grammar. Semantic predicates in the middle of productions throw exceptions when they evaluate to false just like a regular grammar.

An Example Tree Walker

Consider how you would build a simple calculator. One approach would be to build a parser that computed expression values as it recognized the input. For the purposes of illustration, we will build a parser that constructs a tree intermediate representation of the input expression and a tree parser that walks the intermediate representation, computing the result.

Our recognizer, CalcParser, is defined via the following grammar.


class CalcParser extends Parser;
options {
    buildAST = true;   // uses CommonAST by default
}

expr:   mexpr (PLUS^ mexpr)* SEMI!
    ;

mexpr
    :   atom (STAR^ atom)*
    ;

atom:   INT
    ;

The PLUS and STAR tokens are considered operators and, hence, subtree roots; they are annotated with the '^' character. The SEMI token reference is suffixed with the '!' character to indicate it should not be included in the tree.

The scanner for this calculator is defined as follows:


class CalcLexer extends Lexer;

WS	:	(' '
	|	'\t'
	|	'\n'
	|	'\r')
		{ _ttype = Token.SKIP; }
	;

LPAREN:	'('
	;

RPAREN:	')'
	;

STAR:	'*'
	;

PLUS:	'+'
	;

SEMI:	';'
	;

INT	:	('0'..'9')+
	;

The trees generated by this recognizer are simple expression trees. For example, input "3*4+5" results in a tree of the form #( + ( * 3 4 ) 5 ). In order to build a tree walker for trees of this form, you have to describe its structure recursively to ANTLR:


class CalcTreeWalker extends TreeParser;

expr	:	#(PLUS expr expr)
	|	#(STAR expr expr)
	|	INT
	;

Once the structure has been specified, you are free to embed actions to compute the appropriate result. An easy way to accomplish this is to have the expr rule return an integer result and then have each alternative compute the subresult for each subtree. The following tree grammar and actions produces the desired effect:


class CalcTreeWalker extends TreeParser;

expr returns [int r]
{
	int a,b;
	r=0;
}
	:	#(PLUS a=expr b=expr) {r = a+b;}
	|	#(STAR a=expr b=expr) {r = a*b;}
	|	i:INT		      {r = Integer.parseInt(i.getText());}
	;

Notice that no precedence specification is necessary when computing the result of an expression--the structure of the tree encodes this information. That is why intermediate trees are much more than copies of the input in tree form. The input symbols are indeed stored as nodes in the tree, but the structure of the input is encoded as the relationship of those nodes.

The code needed to launch the parser and tree walker is:


import java.io.*;
import antlr.CommonAST;
import antlr.collections.AST;

class Calc {
    public static void main(String[] args) {
        try {
            CalcLexer lexer =
                new CalcLexer(new DataInputStream(System.in));
            CalcParser parser = new CalcParser(lexer);
            // Parse the input expression
            parser.expr();
            CommonAST t = (CommonAST)parser.getAST();
            // Print the resulting tree out in LISP notation
            System.out.println(t.toStringList());
            CalcTreeWalker walker = new CalcTreeWalker();
            // Traverse the tree created by the parser
            int r = walker.expr(t);
            System.out.println("value is "+r);
        } catch(Exception e) {
            System.err.println("exception: "+e);
        }
    }
}

Transformations

While tree parsers are useful to examine trees or generate output from a tree, they must be augmented to handle tree transformations. ANTLR tree parsers support the buildAST option just like regular parsers; this is analogous to the transform mode of SORCERER. Without programmer intervention, the tree parser will automatically copy the input tree to a result tree. Each rule has an implicit (automatically defined) result tree; the result tree of the start symbol can be obtained from the tree parser via the getAST method. The various alternatives and grammar elements may be annotated with "!" to indicate that they should not be automatically linked into the output tree. Portions of, or entire, subtrees may be rewritten.

Actions embedded within the rules can set the result tree based upon tests and tree constructions. See the section on grammar action translations.

An Example Tree Transformation

Revisiting the simple Calc example from above, we can perform a few tree transformations instead of computing the expression value. The action in the following tree grammar optimizes away the addition identity operation (addition to zero).


class CalcTreeWalker extends TreeParser;
options{
    buildAST = true;	// "transform" mode
}

expr:!  #(PLUS left:expr right:expr)
        // '!' turns off auto transform
        {
            // x+0 = x
            if ( #right.getType()==INT &&
                 Integer.parseInt(#right.getText())==0 )
            {
                #expr = #left;
            }
            // 0+x = x
            else if ( #left.getType()==INT &&
                      Integer.parseInt(#left.getText())==0 )
            {
                #expr = #right;
            }
            // x+y
            else {
                #expr = #(PLUS, left, right);
            }
        }
    |   #(STAR expr expr)  // use auto transformation
    |   i:INT
    ;

The code to launch the parser and tree transformer is:


import java.io.*;
import antlr.CommonAST;
import antlr.collections.AST;

class Calc {
    public static void main(String[] args) {
        try {
            CalcLexer lexer =
                new CalcLexer(new DataInputStream(System.in));
            CalcParser parser = new CalcParser(lexer);
            // Parse the input expression
            parser.expr();
            CommonAST t = (CommonAST)parser.getAST();
            // Print the resulting tree out in LISP notation
            System.out.println(t.toLispString());

            CalcTreeWalker walker = new CalcTreeWalker();
            // Traverse the tree created by the parser
            walker.expr(t);
            // Get the result tree from the walker
            t = (CommonAST)walker.getAST();
            System.out.println(t.toLispString());
        } catch(Exception e) {
            System.err.println("exception: "+e);
        }
    }
}

Examining/Debugging ASTs

Often when developing a tree parser, you will get parse errors. Unfortunately, your trees are usually very large, making it difficult to determine where your AST structure error is. To help the situation (I found it VERY useful when building the Java tree parser), I created an ASTFrame class (a JFrame) that you can use to view your ASTs in a Swing tree view. It does not copy the tree, but uses a TreeModel. Run antlr.debug.misc.ASTFrame as an application to test it out or see the new Java grammar Main.java. I am not sure it will live in the same package as I'm not sure how debugging etc... will shake out with future ANTLR versions. Here is a simple example usage:

public static void main(String args[]) {
  // Create the tree nodes
  ASTFactory factory = new ASTFactory();
  CommonAST r = (CommonAST)factory.create(0, "ROOT");
  r.addChild((CommonAST)factory.create(0, "C1"));
  r.addChild((CommonAST)factory.create(0, "C2"));
  r.addChild((CommonAST)factory.create(0, "C3"));

  ASTFrame frame = new ASTFrame("AST JTree Example", r);
  frame.setVisible(true);
}

Version: $Id: //depot/code/org.antlr/release/antlr-2.7.6/doc/sor.html#1 $