Wednesday, February 14, 2007

Recursive descent parsers

Many of us have occasional, rather than frequent, need for parsing a complicated text format (such as a programming language - I'm talking about things more complicated than, say, XML or comma separated values). So often the first step is trying to remember how parser generators work (a parser generator being a tool which takes a grammar and produces a parser, such as byacc, ANTLR, or SableCC). The latest one through this process is Martin Fowler. Through most of my career I've occasionally struggled with ANTLR and yacc, and only recently have come to the conclusion that I'm best off where many of us started: with a hand-written recursive descent parser.

If I lose you with any of the parser jargon, you might need to look at the Dragon Book, but I'm trying to keep it to a minimum.

Anyway, recursive descent is good because:

* You can accomplish everything in your Integrated Development Environment, or pre-existing ant build file. Parser generators, as with other kinds of generated code, impose extra steps every time you modify the grammar (extra manual steps and/or extra build scripting).

* Can accept a wide range of grammars. The only catch is that you need to eliminate left recursion, but in my
experience the solution of a while or do-while loop ends up being even more elegant than the left-recursive grammar you started with.

* Very nice to unit test. Your unit test can call into any production of the grammar. With most parser generators, you can only easily call the top-level production, and then you need to dig around in the tree it returns to find what you are supposed to assert on.

* Easy to understand and debug.

* Gives you complete flexibility to decide how you want to handle trees/actions/etc (often, building up domain/command objects from within the grammar will work well, with no extra tree layer. Depends on the problem I suspect).

* Provides refactoring and navigation tools for the grammar for free. What productions refer to this rule? Just hit "find usages". How do I make a separate rule from part of this rule? Just hit extract method (and you know it will still work - the ANTLR/yacc transformation which looks like extract method is not correctness-preserving in my bitter experience).

Anyway, if you are still with me and want to see how this works, the next step is probably to look at the SQL parser that I wrote this way for Mayfly: the tests (some of them) are at ParserTest and the parser is in Parser.

Saturday, February 10, 2007

Mayfly starts to get options, sooner than I thought it would

OK, a few weeks ago I mused about whether Mayfly was going to need some options (in that case, to set how it handles data types). Well, most of the reasons I imagine for options still remain in the future: "some day we may need/want this". But I did start adding an option to Mayfly. What was the first option? Some profound difference of philosophy about the data model which SQL should present? Maybe the ever-popular "should SQL be more relational?" or the subtle and deep issues around handling of SQL NULL?

No, nothing like that.

It is for case sensitive table names.

The SQL standard, and all databases I know except one, say that table names are case insensitive.

I say, "all databases except one". MIFOS of course is using that one (MySQL). And it is worse than "MySQL makes table names case sensitive". MySQL makes table names case sensitive only if file names are case sensitive (typically Unix). The practical effect of this is that half our team has case insensitive filenames and half has case sensitive ones, and the first group is often accidentally breaking the build (but only for the second group).

I thought a bit about various solutions, and there's a lot to be said about just having Mayfly run in case sensitive mode (on all platforms). So yeah, the CVS version of Mayfly has a method in Database called tableNamesCaseSensitive. Give me a few more days and most of Mayfly should honor it (large parts already do).

Yet another example of how it is hard to anticipate what will really be important (along with familiar cases like prioritizing software features only once you see what users look for and miss when they try to use a prototype, or only doing performance tuning once you have measured where the bottlenecks are).