Wednesday, February 14, 2007

Recursive descent parsers

Many of us have occasional, rather than frequent, need for parsing a complicated text format (such as a programming language - I'm talking about things more complicated than, say, XML or comma separated values). So often the first step is trying to remember how parser generators work (a parser generator being a tool which takes a grammar and produces a parser, such as byacc, ANTLR, or SableCC). The latest one through this process is Martin Fowler. Through most of my career I've occasionally struggled with ANTLR and yacc, and only recently have come to the conclusion that I'm best off where many of us started: with a hand-written recursive descent parser.

If I lose you with any of the parser jargon, you might need to look at the Dragon Book, but I'm trying to keep it to a minimum.

Anyway, recursive descent is good because:

* You can accomplish everything in your Integrated Development Environment, or pre-existing ant build file. Parser generators, as with other kinds of generated code, impose extra steps every time you modify the grammar (extra manual steps and/or extra build scripting).

* Can accept a wide range of grammars. The only catch is that you need to eliminate left recursion, but in my
experience the solution of a while or do-while loop ends up being even more elegant than the left-recursive grammar you started with.

* Very nice to unit test. Your unit test can call into any production of the grammar. With most parser generators, you can only easily call the top-level production, and then you need to dig around in the tree it returns to find what you are supposed to assert on.

* Easy to understand and debug.

* Gives you complete flexibility to decide how you want to handle trees/actions/etc (often, building up domain/command objects from within the grammar will work well, with no extra tree layer. Depends on the problem I suspect).

* Provides refactoring and navigation tools for the grammar for free. What productions refer to this rule? Just hit "find usages". How do I make a separate rule from part of this rule? Just hit extract method (and you know it will still work - the ANTLR/yacc transformation which looks like extract method is not correctness-preserving in my bitter experience).

Anyway, if you are still with me and want to see how this works, the next step is probably to look at the SQL parser that I wrote this way for Mayfly: the tests (some of them) are at ParserTest and the parser is in Parser.

No comments: