Implementing a scripting language with Antlr (Part 2: Parser)

7

In Part 1 I focused on the Lexer that tokenizes the character stream and produces a stream of tokens that our parser can operate on. In this post I’ll focus instead on building the parser that actually parses the language.

When using Antlr, the parser is defined similarly to the lexer and is often included in the same file. Often, you’ll see the parser defined first and the lexer defined last. I prefer to do it the opposite way as the file then builds on already built constructs.

The parser definition consists of a set of rules. Each rule specifies a non-terminal (which must start with a lowercase letter) and a set of tokens or other non-terminals that can replace the specified non-terminal symbol. Each rule will be translated by Antlr into a method in the generated code. Here’s the most basic translation of the grammar described in Part 1.

class ScriptParser extends Parser;

	options {
		k=2;     // Need lookahead of two for
                           // props without keys (to check for the =)
		buildAST=true; 	  // Automatically build the AST while parsing
	}

	script        : (block)* EOF;
	block         : (STRING LBRACE (command)* RBRACE);
	command       : (STRING (property)* SEMI);
	property      : ( (STRING EQUALS)? STRING );

Similar to the Lexer, there is an options block. In the lexer, a lookahead of one character was sufficient. Here, k refers to the number of tokens used in the lookahead instead and we must use a lookahead of 2 because of the two kinds of properties. Specifically, when we see the first string in the property rule, we need to look ahead an extra token to know whether we are seeing a=b or just a.

Let’s generate the parser. Once again we run java antlr.Tool parser.g. Because we have our lexer and parser both defined in the same file, we’ll see the Lexer get regenerated and a new file ScriptParser.java will be generated as well.

We can then write a dumb little test class to run the parser on a grammar file:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.Reader;

import antlr.CommonAST;
import antlr.DumpASTVisitor;

public class TestParser {

    public static void main(String[] args) throws Exception {
        System.out.println("Parsing: " + args[0]);

        Reader reader = new BufferedReader(new FileReader(args[0]));
        ScriptLexer lexer = new ScriptLexer(reader);
        ScriptParser parser = new ScriptParser(lexer);
        parser.script();

        CommonAST ast = (CommonAST)parser.getAST();
        DumpASTVisitor visitor = new DumpASTVisitor();
        visitor.visit(ast);
    }
}

I used the provided helper class DumpASTVisitor to dump the AST to the screen. Notice that to do the parsing we call parser.script(). This indicates that the parser should begin executing the rules of the parser (using the lexer provided in the constructor) starting with the the script rule, which is our start rule. Antlr considers a rule that does not get called by any other rule to be a start rule.

Let’s run this test class on our test.script file:

Parsing: src/test.script
morning [4]
{ [7]
wake_up [4]
; [10]
eat [4]
bacon [4]
; [10]
brush_teeth [4]
; [10]
wash [4]
hair [4]
= [9]
true [4]
minutes [4]
= [9]
10 [4]
; [10]
} [8]
evening [4]
{ [7]
read [4]
book [4]
= [9]
"How To Be a Ninja" [4]
; [10]
brush_teeth [4]
; [10]
sleep [4]
hours [4]
= [9]
8 [4]
; [10]
} [8]

What we see here is an AST that is just a flat list of tokens. So, while Antlr built the tree for us, it didn’t provide any useful structure. To me, it seems like it would be useful for the default AST to insert a node as each rule is processed so the AST you got for free would mirror your grammar. But, that isn’t the case. So, on we go…

Antlr provides two built-in operators that allow you to indicate how the AST should be built within your parser. The ! symbol indicates that the token being matched should not be included in the AST at all. This is commonly used with punctuation tokens. The ^ symbol indicates that in the current rule, the specified token should be used as a root with all other tokens beneath it. This is very commonly used when parsing expression trees (arithmetic with +,-,/,* or logic with AND/OR, etc) where we want to make the arithmetic operator the root.

Let’s start by adding the ! marker to leave out our punctuation. We’ll change just these lines by adding ! to punctuation tokens:

	script        : (block)* EOF!;
	block         : (STRING LBRACE! (command)* RBRACE!);
	command       : (STRING (property)* SEMI!);
	property      : ( (STRING EQUALS!)? STRING );

Then we can regenerate our parser and re-run our test:

Parsing: src/test.script
morning [4]
wake_up [4]
eat [4]
bacon [4]
brush_teeth [4]
wash [4]
hair [4]
true [4]
minutes [4]
10 [4]
evening [4]
read [4]
book [4]
"How To Be a Ninja" [4]
brush_teeth [4]
sleep [4]
hours [4]
8 [4]

Now we’ve lost all the punctuation tokens! Next up is to create a more tree-like structure. What we really want is to create a tree where the root is the script itself and an internal node exists for each block, command, and property.

The ^ operator is useful when the token you want as a node in the tree actually exists in your parser rule. But when it doesn’t, you need to create an “imaginary token” and insert that as a root for your rule. This is a common idiom in Antlr and involves some black magic that is explained a bit here. Here is the modified parser with the additional code:

class ScriptParser extends Parser;

	options {
		k=2; 		    // Need lookahead of two for props without keys (to check for the =)
		buildAST=true; 	// Automatically build the AST while parsing
	}

	tokens {
		SCRIPT;		// Imaginary token inserted at the root of the script
		BLOCK;		// Imaginary token inserted at the root of a block
		COMMAND;     // Imaginary token inserted at the root of a command
		PROPERTY;     // Imaginary token inserted at the root of a property
	}

	script : (block)* EOF!
		{#script = #([SCRIPT, "SCRIPT"], #script);};

	block : (STRING LBRACE! (command)* RBRACE!)
		{#block = #([BLOCK, "BLOCK"], #block);};

	command : (STRING (property)* SEMI!)
		{#command = #([COMMAND, "COMMAND"], #command);};

	property : ( (STRING EQUALS!)? STRING)
		{#property = #([PROPERTY, "PROPERTY"], #property);};

First, you’ll see a new TOKENS block. This block is usually used to declare string literal tokens as a shortcut in the parser. In this case we declare the token without declaring a string for it; this creates an imaginary token (one that has no representation in the token stream coming from the lexer).

Then you’ll see that each rule has a new block of magic code underneath it. From what I gather, the #property (in the last rule) specifies the AST produced by the property rule we’re in. We are then redefining that rule with a new AST, specified using the #(A, B, …) notation which indicates that we should a node A with children B, … In this case, we construct a new node of type PROPERTY (our imaginary token) and add as children all of the tokens that were collected into #property.

So, we then re-generate the parser and re-run our TestParser test to get the following hierarchical output:

Parsing: src/test.script
SCRIPT [11]
   BLOCK [12]
      morning [4]
      COMMAND [13]
         wake_up [4]
      COMMAND [13]
         eat [4]
         PROPERTY [14]
            bacon [4]
      COMMAND [13]
         brush_teeth [4]
      COMMAND [13]
         wash [4]
         PROPERTY [14]
            hair [4]
            true [4]
         PROPERTY [14]
            minutes [4]
            10 [4]
   BLOCK [12]
      evening [4]
      COMMAND [13]
         read [4]
         PROPERTY [14]
            book [4]
            "How To Be a Ninja" [4]
      COMMAND [13]
         brush_teeth [4]
      COMMAND [13]
         sleep [4]
         PROPERTY [14]
            hours [4]
            8 [4]

So (you may be asking), what good is an AST when I’m writing my program? Good freaking question. The AST is an object model that you can use to drive an interpreter, collect information or do many useful things with. However, writing code to walk through an AST is boring (and error prone and just gross). Some better options are:

  1. Write an AST visitor – Antlr has a visitor framework built in to support this
  2. Construct and return your own object structure within the parser – we haven’t used this capability but it is possible to pass objects into the parser rules and return objects from the parser rules. This allows you to collect information during the parse itself and return your own objects from the parser.
  3. Write a tree grammar – another option is to write what Antlr calls a tree grammar that specifies how to walk the AST and perform operations or construct objects.

I’ve done both 2 and 3 in different situations. #2 is attractive as it builds your objects write into the parser and you can effectively build a custom AST with your own objects (and your own behavior). I find this approach maps pretty well.

But I have to say that #3 seems like a better approach because it decouples the AST building from how an AST is used. If you focus on just building a valid AST in your parser, you can then decouple any logic from the parser and move it into a tree grammar. That gives you the freedom to write multiple tree grammars for different purposes (one for printing, one for interpreting, one for stats generation, etc). If you have multiple actions to take against your AST, this feels like the right choice to me.

So, we’ll pause again at this point and save the tree grammar for Part 3. Here are the files used in this post:

Links to the rest of the series:

Comments

7 Responses to “Implementing a scripting language with Antlr (Part 2: Parser)”
  1. Jono says:

    Nice tutorial. I am looking at in, in fact to work with parsing Java source. I just want to count the number of function calls, if statement, etc. You show a way to print tabbed output with the rule names. I am wondering if there is a way to get those rule names printed without having to write something explicitly into each rule.

  2. M. Shekhar says:

    Hi Alex,
    I really liked your article. However I found a couple of issues which I thought you could help me out with. I’m still new at Antlr, so pl. bear with me.

    When I entered any of the ‘{‘, ‘}’, ‘=’ or ‘;’ tokens at the end of any block in the test script, it does NOT throw any syntax errors and I found really it puzzling. For e.g.

    # What to do in the morning
    morning {
    wake_up;
    eat bacon;
    brush_teeth;
    wash hair=true minutes=10;
    } ; // NO ERROR HERE THROWN HERE

    # What to do in the evening
    evening {
    read book=”How To Be a Ninja”;
    brush_teeth;
    sleep hours=8;
    }

    But if I replace the above ‘;’ with characters other than those defined in the “// Punctuation” section in the lexer definition, then antlr does throw an error. Could you pl. point out to me what the problem is, and how would you resolve it?

    Thanks a bunch,
    Manju

  3. Alex says:

    Hi Manju,

    Thanks for pointing this out! As it turns out, I forgot something critical in the parser definition.

    But first, the reason you’re seeing an error with characters other than those in the punctuation section is that these are not valid according to the lexer, so the lexer is throwing an error.

    In the case of the extra semicolon, the lexer does understand the token, so no error is thrown (by the lexer). If you look at the output of the TestParser program with your example script, you’ll notice that it prints only the first block, which was my first big clue to my mistake. The key is that I didn’t tell Antlr to parse the *whole* input – so instead Antlr just consumed input and matched it as asked until it was no longer valid, in which case it just stopped, leaving a slew of tokens unparsed.

    Obviously, this is not what we want. To fix this, we need to add the pre-defined EOF token to the end of our script rule – this tells Antlr that we expect to match the entire input stream and end with EOF:

    script : (block)* EOF

    Then if we run with your script we will see the expected error:

    Parsing: test/puredanger/parser/comment.script
    line 7:5: expecting EOF, found ‘;’

    Thanks again and I’ll update the article appropriately.

  4. M Shekhar says:

    Hi Alex, I’m glad I brought it to your attention. Thanks for the explanation. Also, I had a couple of requests. Perhaps they might need seperate articles of their own.

    Firstly, I was wondering if you could discuss how to properly handle errors and report them(line no., module, etc). instead of the cryptic ‘exception thrown’ kind of errors. I’ve seen a section on error recovery in the antlr’s documentation, but it’s such a mind-bender :-) I’m still new at all this, so pl. excuse my ignorance.

    Secondly, what’s the best way to parse included modules, e.g. if a script includes another script(a la C includes), I would like to parse
    the included file first. Any links to articles on this subject would be great too.

    Thanks again
    Manju
    -manju

  5. Alex says:

    Regarding error handling, I posted about recovering line and column numbers in an AST earlier. Beyond that, I have not done much yet with error recovery in antlr.

    Antlr does support the integration of multiple grammars and even grammar extension through the notion of “vocabularies”. You might check out the antlr doc on Vocabularies for some more info.

  6. M Shekhar says:

    Thanks Alex, that helps.

  7. M Shekhar says:

    Alex, I took at look at Vocubularies and seems like that is not what I’m looking for. My problem is similar to this thread that someone had started sometime back, but apparently had no responses.

    http://www.antlr.org:8080/pipermail/antlr-interest/2005-September/013603.html

    Let me know, if you have any pointers to this problem.

    Thanks,
    manju

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!