ANTLR Specification: Vocabularies

Token Vocabularies

Every grammar specifies language structure with rules (substructures) and vocabulary symbols. These symbols are equated with integer "token types" for efficient comparison at run-time. The files that define this mapping from symbol to token type are fundamental to the execution of ANTLR and ANTLR-generated parsers. This document describes the files used and generated by ANTLR plus the options used to control the vocabularies.

Introduction

A parser grammar refers to tokens in its vocabulary by symbol that will correspond to Token objects, generated by the lexer or other token stream, at parse-time. The parser compares a unique integer token type assigned to each symbol against the token type stored in the token objects. If the parser is looking for token type 23, but finds that the first lookahead token's token type, LT(1).getType(), is not 23, then the parser throws MismatchedTokenException.

A grammar may have an import vocabulary and always has an export vocabulary, which can be referenced by other grammars. Imported vocabularies are never modified and represent the "initial condition" of the vocabulary. Do not confuse importVocabular

The following represent the most common questions:

How does ANTLR decide which vocabulary symbol gets what token type?

Each grammar has a token manager that manages a grammar's export vocabulary. The token manager can be preloaded with symbol / token type pairs by using the grammar importVocab option. The option forces ANTLR to look for a file with mappings that look like:

PLUS=44

Without the importVocab option, the grammar's token manager is empty (with one caveat you will see later).

Any token referenced in your grammar that does not have a predefined token type is assigned a type in the order encountered. For example, in the following grammar, tokens A and B will be 4 and 5, respectively:

class P extends Parser;
a : A B ;

Vocabulary file names are of the form: NameTokenTypes.txt.

Why do token types start at 4?

Because ANTLR needs some special token types during analysis. User-defined token types must begin after 3.

What files associated with vocabulary does ANTLR generate?

ANTLR generates VTokenTypes.txt and VTokenTypes.java for vocabulary V where V is either the name of the grammar or specified in an exportVocab=V option. The text file is sort of a "freezedried" token manager and represents the persistent state needed by ANTLR to allow a grammar in a different file to see a grammar's vocabulary including string literals etc... The Java file is an interface containing the token type constant definitions. Generated parsers implement one of these interfaces to obtain the appropriate token type definitions.

How does ANTLR synchronize the symbol-type mappings between grammars in the same file and in different files?

The export vocabulary for one grammar must become the import vocabulary for another or the two grammars must share a common import vocabulary.

Imagine a parser P in p.g:

// yields PTokenTypes.txt
class P extends Parser;
// options {exportVocab=P;} ---> default!
decl : "int" ID ;

and a lexer L in l.g

class L extends Lexer;
options {
  importVocab=P; // reads PTokenTypes.txt
}
ID : ('a'..'z')+ ;

ANTLR generates LTokenTypes.txt and LTokenTypes.java even though L is primed with values from P's vocabulary.

Grammars in different files that must share the same token type space should use the importVocab option to preload the same vocabulary.

If these grammars are in the same file, ANTLR behaves in exactly same way. However, you can get the two grammars to share the vocabulary (allowing them both to contribute to the same token space) by setting their export vocabularies to the same vocabulary name. For example, with P and L in one file, you can do the following:

// yields PTokenTypes.txt
class P extends Parser;
// options {exportVocab=P;} ---> default!
decl : "int" ID ;

class L extends Lexer;
options {
  exportVocab=P; // shares vocab P
}
ID : ('a'..'z')+ ;

If you leave off the vocab options from L, it will choose to share the first export vocabulary in the file; in this case, it will share P's vocabulary.

// yields PTokenTypes.txt
class P extends Parser;
decl : "int" ID ;

// shares P's vocab
class L extends Lexer;
ID : ('a'..'z')+ ;

The token type mapping file looks like this

P    // exported token vocab name
LITERAL_int="int"=4
ID=5

Grammar Inheritance and Vocabularies

Grammars that extend supergrammars inherit rules, actions, and options but what vocabulary does the subgrammar use and what token vocabulary does it use? ANTLR sees the subgrammar as if you had cut and paste all of the nonoverridden rules of the supergrammar into the subgrammar like an include. Therefore, the set of tokens in the subgrammar is the union of the tokens defined in the supergrammar and in the supergrammar. All grammars export a vocabulary file and so the subgrammar will export and use a different vocabulary than the supergrammar. The subgrammar always imports the vocabulary of the supergrammar unless you override it with an importVocab option in the subgrammar.

A grammar Q that extends P primes its vocabulary with P's vocabulary as if Q had specified option importVocab=P. For example, the following grammar has two token symbols.

class P extends Parser;
a : A Z ;

The subgrammar, Q, initially has the same vocabulary as the supergrammar, but may add additional symbols.

class Q extends P;
f : B ;

In this case, Q defines one more symbol, B, yielding a vocabulary for Q of {A,B,C}.

The vocabulary of a subgrammar is always a superset of the supergrammar's vocabulary. Note that overriding rules does not affect the initial vocabulary.

If your subgrammar requires new lexical structures, unused by the supergrammar, you probably need to have the subparser use a sublexer. Override the initial vocabulary with an importVocab option that specifies the vocabulary of the sublexer. For example, assume parser P uses PL as a lexer. Without an importVocab override, Q's vocabulary would use P's vocab and, consequently, PL's vocabulary. If you would like Q to use token types from another lexer, say QL, do the following:

class Q extends P;
options {
  importVocab=QL;
}
f : B ;

Q's vocab will now be the same or a superset of QL's vocabulary.

Recognizer Generation Order

If all of your grammars are in one file, you do not have to worry about which grammar file ANTLR should process first, however, you still need to worry about the order in which ANTLR sees the grammars within the file. If you try to import a vocabulary that will be exported by a grammar later in the file, ANTLR will complain that it cannot load the file. The following grammar file will cause antlr to fail:

class P extends Parser;
options {
importVocab=L;
}

a : "int" ID;

class L extends Lexer;
ID : 'a';

ANTLR will complain that it cannot find LTokenTypes.txt because it has not seen grammar L yet in the grammar file. On the other hand, if you happened to have LTokenTypes.txt lying around (from a previous run of ANTLR on the grammar file when P did not exist?), ANTLR will load it for P and then overwrite it again for L. ANTLR must assume that you want to load a vocabulary generated from another file as it cannot know what grammars are approaching even in the same file.

In general, if you want grammar B to use token types from grammar A (regardless of grammar type), then you must run ANTLR on grammar A first. So, for example, a tree grammar that uses the vocabulary of the parser grammar should be run after ANTLR has generated the parser.

When you want a parser and lexer, for example, to share the same vocabulary space, all you have to do is place them in the same file with their export vocabs pointing at the same place. If they are in separate files, have the parser's import vocab set to the lexer's export vocab unless the parser is contributing lots of literals. In this case, reverse the import/export relationship so the lexer uses the export vocabulary of the parser.

Tricky Vocabulary Stuff

What if your grammars are in separate files and you still want them to share all or part of a token space. There are two solutions: (1) have the grammars import the same vocabulary or (2) have the grammars all inherit from the same base grammar that contains the common token space.

The first solution applies when you have two lexers and two parsers that must parse radically different portions of the input. The example in examples/java/multiLexer of the ANTLR 2.6.0 distribution is such a situation. The javadoc comments are parsed with a different lexer/parser than the regular Java portion of the input. The "*/" terminating comment lexical structure is necessarily recognized by the javadoc lexer, but it is natural to have the Java parser enclose the launch of the javadoc parser with open/close token references:

javadoc
  : JAVADOC_OPEN
    {
    DemoJavaDocParser jdocparser =
      new DemoJavaDocParser(getInputState());
    jdocparser.content();
    }
    JAVADOC_CLOSE
  ;

The problem is: the javadoc lexer defines JAVADOC_CLOSE and hence defines its token type. The vocabulary of the Java parser is based upon the Java lexer not the javadoc lexer, unfortunately. To get the javadoc lexer and Java lexer to both see JAVADOC_CLOSE (and have the same token type), have both lexers import a vocabulary file that contains this token type definition. Here are the heads of DemoJavaLexer and DemoJavaDocLexer:

class DemoJavaLexer extends Lexer;
options {
  importVocab = Common;
}
...

class DemoJavaDocLexer extends Lexer;
options {
  importVocab = Common;
}
...

CommonTokenTypes.txt contains:

Common // name of the vocab
JAVADOC_CLOSE=4

The second solution to vocabulary sharing applies when you have say one parser and three different lexers (e.g., for various flavors of C). If you only want one parser for space efficiency, then the parser must see the vocabulary of all three lexers and prune out the unwanted structures grammatically (with semantic predicates probably). Given CLexer, GCCLexer, and MSCLexer, make CLexer the supergrammar and have CLexer define the union of all tokens. For example, if MSCLexer needs "_int32" then reserve a token type visible to all lexers in CLexer:

tokens {
  INT32;
}

In the MSCLexer then, you can actually attach a literal to it.

tokens {
  INT32="_int32"
}

In this manner, the lexers will all share the same token space allowing you to have a single parser recognize input for multiple C variants.

Version: $Id: //depot/code/org.antlr/release/antlr-2.7.6/doc/vocab.html#1 $