Implementing a scripting language with Antlr (Part 2: Parser)

In Part 1 I focused on the Lexer that tokenizes the character stream and produces a stream of tokens that our parser can operate on. In this post I’ll focus instead on building the parser that actually parses the language.

When using Antlr, the parser is defined similarly to the lexer and is often included in the same file. Often, you’ll see the parser defined first and the lexer defined last. I prefer to do it the opposite way as the file then builds on already built constructs.

The parser definition consists of a set of rules. Each rule specifies a non-terminal (which must start with a lowercase letter) and a set of tokens or other non-terminals that can replace the specified non-terminal symbol. Each rule will be translated by Antlr into a method in the generated code. Here’s the most basic translation of the grammar described in Part 1.

class ScriptParser extends Parser;

	options { 
		k=2;     // Need lookahead of two for 
                           // props without keys (to check for the =)
		buildAST=true; 	  // Automatically build the AST while parsing
	}
	
	script        : (block)* EOF; 
	block         : (STRING LBRACE (command)* RBRACE);
	command       : (STRING (property)* SEMI);	
	property      : ( (STRING EQUALS)? STRING );

Similar to the Lexer, there is an options block. In the lexer, a lookahead of one character was sufficient. Here, k refers to the number of tokens used in the lookahead instead and we must use a lookahead of 2 because of the two kinds of properties. Specifically, when we see the first string in the property rule, we need to look ahead an extra token to know whether we are seeing a=b or just a.

Let’s generate the parser. Once again we run java antlr.Tool parser.g. Because we have our lexer and parser both defined in the same file, we’ll see the Lexer get regenerated and a new file ScriptParser.java will be generated as well.

We can then write a dumb little test class to run the parser on a grammar file:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.Reader;

import antlr.CommonAST;
import antlr.DumpASTVisitor;

public class TestParser {

    public static void main(String[] args) throws Exception {
        System.out.println("Parsing: " + args[0]);
        
        Reader reader = new BufferedReader(new FileReader(args[0]));
        ScriptLexer lexer = new ScriptLexer(reader);
        ScriptParser parser = new ScriptParser(lexer);
        parser.script();

        CommonAST ast = (CommonAST)parser.getAST();
        DumpASTVisitor visitor = new DumpASTVisitor();
        visitor.visit(ast);
    }
}

I used the provided helper class DumpASTVisitor to dump the AST to the screen. Notice that to do the parsing we call parser.script(). This indicates that the parser should begin executing the rules of the parser (using the lexer provided in the constructor) starting with the the script rule, which is our start rule. Antlr considers a rule that does not get called by any other rule to be a start rule.

Let’s run this test class on our test.script file:

Parsing: src/test.script
morning [4] 
{ [7] 
wake_up [4] 
; [10] 
eat [4] 
bacon [4] 
; [10] 
brush_teeth [4] 
; [10] 
wash [4] 
hair [4] 
= [9] 
true [4] 
minutes [4] 
= [9] 
10 [4] 
; [10] 
} [8] 
evening [4] 
{ [7] 
read [4] 
book [4] 
= [9] 
"How To Be a Ninja" [4] 
; [10] 
brush_teeth [4] 
; [10] 
sleep [4] 
hours [4] 
= [9] 
8 [4] 
; [10] 
} [8]

What we see here is an AST that is just a flat list of tokens. So, while Antlr built the tree for us, it didn’t provide any useful structure. To me, it seems like it would be useful for the default AST to insert a node as each rule is processed so the AST you got for free would mirror your grammar. But, that isn’t the case. So, on we go…

Antlr provides two built-in operators that allow you to indicate how the AST should be built within your parser. The ! symbol indicates that the token being matched should not be included in the AST at all. This is commonly used with punctuation tokens. The ^ symbol indicates that in the current rule, the specified token should be used as a root with all other tokens beneath it. This is very commonly used when parsing expression trees (arithmetic with +,-,/,* or logic with AND/OR, etc) where we want to make the arithmetic operator the root.

Let’s start by adding the ! marker to leave out our punctuation. We’ll change just these lines by adding ! to punctuation tokens:

script        : (block)* EOF!;	
	block         : (STRING LBRACE! (command)* RBRACE!);
	command       : (STRING (property)* SEMI!);	
	property      : ( (STRING EQUALS!)? STRING );

Then we can regenerate our parser and re-run our test:

Parsing: src/test.script
morning [4] 
wake_up [4] 
eat [4] 
bacon [4] 
brush_teeth [4] 
wash [4] 
hair [4] 
true [4] 
minutes [4] 
10 [4] 
evening [4] 
read [4] 
book [4] 
"How To Be a Ninja" [4] 
brush_teeth [4] 
sleep [4] 
hours [4] 
8 [4]

Now we’ve lost all the punctuation tokens! Next up is to create a more tree-like structure. What we really want is to create a tree where the root is the script itself and an internal node exists for each block, command, and property.

The ^ operator is useful when the token you want as a node in the tree actually exists in your parser rule. But when it doesn’t, you need to create an “imaginary token” and insert that as a root for your rule. This is a common idiom in Antlr and involves some black magic that is explained a bit here. Here is the modified parser with the additional code:

class ScriptParser extends Parser;

	options { 
		k=2; 		    // Need lookahead of two for props without keys (to check for the =)
		buildAST=true; 	// Automatically build the AST while parsing
	}
	
	tokens {
		SCRIPT;		// Imaginary token inserted at the root of the script
		BLOCK;		// Imaginary token inserted at the root of a block
		COMMAND;     // Imaginary token inserted at the root of a command
		PROPERTY;     // Imaginary token inserted at the root of a property
	}
	
	script : (block)* EOF!
		{#script = #([SCRIPT, "SCRIPT"], #script);};
	
	block : (STRING LBRACE! (command)* RBRACE!)
		{#block = #([BLOCK, "BLOCK"], #block);};

	command : (STRING (property)* SEMI!)
		{#command = #([COMMAND, "COMMAND"], #command);};
	
	property : ( (STRING EQUALS!)? STRING)
		{#property = #([PROPERTY, "PROPERTY"], #property);};

First, you’ll see a new TOKENS block. This block is usually used to declare string literal tokens as a shortcut in the parser. In this case we declare the token without declaring a string for it; this creates an imaginary token (one that has no representation in the token stream coming from the lexer).

Then you’ll see that each rule has a new block of magic code underneath it. From what I gather, the #property (in the last rule) specifies the AST produced by the property rule we’re in. We are then redefining that rule with a new AST, specified using the #(A, B, …) notation which indicates that we should a node A with children B, … In this case, we construct a new node of type PROPERTY (our imaginary token) and add as children all of the tokens that were collected into #property.

So, we then re-generate the parser and re-run our TestParser test to get the following hierarchical output:

Parsing: src/test.script
SCRIPT [11] 
   BLOCK [12] 
      morning [4] 
      COMMAND [13] 
         wake_up [4] 
      COMMAND [13] 
         eat [4] 
         PROPERTY [14] 
            bacon [4] 
      COMMAND [13] 
         brush_teeth [4] 
      COMMAND [13] 
         wash [4] 
         PROPERTY [14] 
            hair [4] 
            true [4] 
         PROPERTY [14] 
            minutes [4] 
            10 [4] 
   BLOCK [12] 
      evening [4] 
      COMMAND [13] 
         read [4] 
         PROPERTY [14] 
            book [4] 
            "How To Be a Ninja" [4] 
      COMMAND [13] 
         brush_teeth [4] 
      COMMAND [13] 
         sleep [4] 
         PROPERTY [14] 
            hours [4] 
            8 [4]

So (you may be asking), what good is an AST when I’m writing my program? Good freaking question. The AST is an object model that you can use to drive an interpreter, collect information or do many useful things with. However, writing code to walk through an AST is boring (and error prone and just gross). Some better options are:

Write an AST visitor – Antlr has a visitor framework built in to support this
Construct and return your own object structure within the parser – we haven’t used this capability but it is possible to pass objects into the parser rules and return objects from the parser rules. This allows you to collect information during the parse itself and return your own objects from the parser.
Write a tree grammar – another option is to write what Antlr calls a tree grammar that specifies how to walk the AST and perform operations or construct objects.

I’ve done both 2 and 3 in different situations. #2 is attractive as it builds your objects write into the parser and you can effectively build a custom AST with your own objects (and your own behavior). I find this approach maps pretty well.

But I have to say that #3 seems like a better approach because it decouples the AST building from how an AST is used. If you focus on just building a valid AST in your parser, you can then decouple any logic from the parser and move it into a tree grammar. That gives you the freedom to write multiple tree grammars for different purposes (one for printing, one for interpreting, one for stats generation, etc). If you have multiple actions to take against your AST, this feels like the right choice to me.

So, we’ll pause again at this point and save the tree grammar for Part 3. Here are the files used in this post:

Links to the rest of the series:

Pure Danger Tech

Implementing a scripting language with Antlr (Part 2: Parser)