Pure Danger Tech


navigation
home

Implementing a scripting language with Antlr (Part 3: Tree Walker)

17 Jan 2007

In Part 1 and Part 2 of this series, we focused on writing the lexer and parser for our scripting language. In this entry, we focus on writing a tree walking grammar to build custom objects from our AST. When we closed the previous entry, we had modified our parser to generate the AST we wished to build but were reviewing our options for actually building our own objects from that AST.

Before we dive into that, let’s first step back and write some simple objects that will hold the output state of this transformation. The objects are Script, which holds a series of named Blocks, each with a list of Commands, which have a name and properties. In the interest of brevity, I’m not going to show the code here (see the links at the end). I defined a toString() on each object that should reproduce a valid script (we’ll use that for testing later).

Tree grammar rules are written with a lisp-ish notation #(node child1 child2 …). This will match a node (based on token) if it has children child1, child2, etc. Deeper structures can be embedded within this.

Let’s build this tree grammar from the top down. First we’ll start with the script node itself. We’ll look for a SCRIPT token with child tokens that are BLOCKs.

class ScriptWalker extends TreeParser;

	script returns [Script script]
	{
	    script = new Script();
	}
	    :  #( SCRIPT (block[script] )* );

This tree parser has a single rule (“script”) which matches the SCRIPT node, then matches any number of “block” sub-rules. We see some new notation here as well. The script rule has a returns [Script script] section that defines the return type of this rule (and the generated Java method). The [ ] are used instead of ( ) as the ( ) is used as a grouping construct in the rule definitions instead. Then we see a code block or action in { } that initializes the Script instance. Within the rule itself, we see a reference to the block rule and the script variable is passed as an argument to the rule. The script variable will collect Block objects as they are found.

Let’s continue defining the tree grammar by defining the block rule:

block [Script script] 
	{
	    Block block = null;
	}
	    :  #(BLOCK
		     (
		        b:STRING  {  block = new Block(b.getText()); } 
			(command[block])*
		    )
		)
		{ script.addBlock(block); }
		;

Here the block rule takes a parameter in [ ], which is the Script to add to. We look for a BLOCK token with a child block name (the STRING token) and any number of commands, as defined in a separate command rule.

Let’s finish it out with the command and property rules:

command [Block block]
	{
	    Command command = null;
	}
	    :  #(COMMAND
		    (
		        c:STRING 
                        { 
			    command = new Command(c.getText()); 
			    block.addCommand(command);
			}
			(property[command])*
		    )
		);
	
	property [Command command] 
		: #( PROPERTY
			(
		    	    p1:STRING (p2:STRING)?
			    {
				if(p2 == null) {
				    command.addProperty(p1.getText());
				} else {
			   	    command.addProperty(p1.getText(), p2.getText());
				}
			    }
			)
		);

These should now be pretty familiar. One note is on the property rule – the second STRING child node is optional (only there if using the key=value syntax) and that’s indicated with the ?, just as in the other grammars.

One item of interest here is what happens if the string is a quoted string. In that case, the quotes will be included in the token. So, we need to strip the quotes if they exist. To do this, you can add an arbitrary chunk of code after the definition of the ScriptWalker that defines a helper method:

class ScriptWalker extends TreeParser;

{
    private String removeQuotes(String value) {
        if(value != null && value.startsWith("\"")) {
            return value.substring(1, value.length()-1);
        }
        return value;
    }
}

And then we just wrap a call to removeQuotes() around each call to getText() in the prior functions. And that’s it! We’re all done with our ScriptWalker. Here’s a dumb test class to try parsing the file into an AST, then running the walker to generate objects, then print the objects back out as a script:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.Reader;

import antlr.CommonAST;

public class TestWalker {

    public static void main(String args[]) throws Exception {
        System.out.println("Walking: " + args[0]);
        
        Reader reader = new BufferedReader(new FileReader(args[0]));
        ScriptLexer lexer = new ScriptLexer(reader);
        ScriptParser parser = new ScriptParser(lexer);

        parser.script();
        CommonAST t = (CommonAST)parser.getAST();
        
        // Traverse the tree created by the parser
        ScriptWalker walker = new ScriptWalker();
        Script script = walker.script(t);
        System.out.println(script);
    }
}

I’ve defined my tree parser in the same file along with my lexer and parser, so I need to call antlr.Tool again to regenerate everything. You’ll see a new generated file ScriptWalker.java show up. Compile, then run java TestWalker test.script to try it out:

Walking: src/test.script
morning {
  wake_up;
  eat bacon;
  brush_teeth;
  wash hair=true minutes=10;
}

evening {
  read book="How To Be a Ninja";
  brush_teeth;
  sleep hours=8;
}

Success!! We can now move on to writing an interpreter for this generic scripting language. For the complete lexer, parser, and tree parser, see script.g in the file list below. Everything is in one file and even with comments, it’s not too long.

For a language as simple as this, you might wonder why you would go to the trouble of learning and using a tool like Antlr or JavaCC. For one reason, that code is tedious, error-prone, and very likely to be a maintenance nightmare. For another, languages often start simple and increase in complexity with new requirements. Generally speaking, a simple change in the language will result in a simple change in the grammar. With a hand-written parser, often a simple change in the language can be a daunting task.

Hopefully you had fun working through this example with me. I’m considering writing a follow-up series to show the actual interpreter code as well. If you would be interested in such a series, drop me a note by email (see upper-right corner) or comment. Thanks for reading!

Links to the rest of the series: