Thursday, May 22, 2008

Parse XML with saxophone

Programmers often want to parse an XML document and perform some actions as we do (for example, build up an in-memory data structure, write data to a database, print output to the console, etc).

For the most part, there have only been two well-known ways of doing this. The first is to read the XML document into a DOM, which is an in-memory tree representing the document. Then you walk the DOM tree doing your own processing. This is usually a pretty convenient way to go. But it has several downsides, the most obvious and probably biggest of which is that if the XML document is too big to fit in memory it won't work.

The second approach is a streaming API such as SAX or xmlpull. SAX calls you whereas you call xmlpull, but either way you are getting a low-level stream of events (start tag, end tag, and text being the most important). For simple tasks, this isn't so bad. But when you have to assemble data from a few different parts of the document, you need to set variables to keep track of what you've gotten and what you are waiting for, and it is possible to make yourself a pretty big pile of spaghetti. My co-workers and I had such a problem recently, and our answer was a small library, which we call saxophone. Saxophone sits on top of SAX, and converts the raw stream of events into calls to handlers, and lets the handlers return data which get passed to other handlers. The current implementation is in ruby, although the ideas should port to other languages if anyone has such a need.

Here's a simple example. We want to parse a web feed and print out each of the titles. That is, there will be, buried somewhere in the XML, something like <title>My Dog has Fleas</title> and we want to print out "My Dog has Fleas" and likewise for every other occurrence of a title tag.

The full example can be found in the examples directory of saxophone, but the key part is:


parser = Saxophone.new(
:title => lambda { | element | puts element.all_text() }
)


This is saying that any time you see a "title" tag, call this handler, and provide it all the text directly under title via the all_text() method.

This example doesn't demonstrate everything. Handlers can return values, which are then available to higher handlers. Handlers also have access to attributes directly.

If people want to hear more about it, I can write some more documentation and examples for some of these other features. But the key thing here is that we have a fairly concise way to (a) match the parts of the XML that we care about, and (b) synthesize the results of those matches into larger data structures (but only as far as we want - after we have gotten everything we need to, for example, write a database row, we can return that memory to the system). This is all done in a streaming fashion. That is, we don't need to store the whole document, or the whole result, in memory.

The whole thing kind of reminds me of XSLT or XQuery. In know in XSLT for sure, and probably XQuery, there are some tasks that turn out to be really awkward, and that's probably true of saxophone as well. But saxophone seems like a good match for a few of the problems we've had. And of course, having a library within your regular programming language (in this case Ruby) can also be a plus over a special-purpose language.

Saxophone is free software, available here as part of the pivotalrb project on rubyforge, which contains a variety of open source code released by the ruby/java consulting firm Pivotal. I and my pair wrote saxophone as part of a Pivotal project.

1 comment:

Anonymous said...

You may also want to look at vtd-xml,the latest and most advanced xml technology

http://vtd-xml.sf.net