Friday, July 20, 2012

Parsing large XML files

Every once in a while, we need to parse large XML files. Here "large" means that the file won't fit in memory, so we can't just suck it in using nokogiri (or our favorite in-memory XML library). SAX is fine as a low level parser to hand tell you where the tags start and end, but trying to do any significant processing will turn into spaghetti unless you have a bit of a framework. The last time I visited this topic, I ended up writing a library, saxophone, which invoked callbacks when it encountered certain named tags. Saxophone is sitting in an obscure git repository; I could put it up as a gem if someone wants it; the big question is whether there is something better out there. The wasabi WSDL parser has been trying their own mini-framework (partially special purpose) described at this issue. But probably the best I've seen so far is sax-machine (specifically, the lazy option thereto). I haven't spent much time playing with it (at least not yet), but it seems like a better starting point than starting from scratch with a new gem. If you do end up writing code directly on top of SAX, just remember this: keep a stack of start tags and end tags. Following this idiom might cut down on the buggy spaghetti that I've seen when I've tried to do without something like saxophone or sax-machine. Update: I fixed the above link to the wasabi issue, which had changed. Not sure how long-lived any of these links are going to be, but here's another one: lib/wasabi/sax_parser.rb from the sax-parser branch. The key is the stack (pushed on start tag, popped on end tag) and the matchers.