Friday, July 20, 2012

Parsing large XML files

Every once in a while, we need to parse large XML files. Here "large" means that the file won't fit in memory, so we can't just suck it in using nokogiri (or our favorite in-memory XML library). SAX is fine as a low level parser to hand tell you where the tags start and end, but trying to do any significant processing will turn into spaghetti unless you have a bit of a framework. The last time I visited this topic, I ended up writing a library, saxophone, which invoked callbacks when it encountered certain named tags. Saxophone is sitting in an obscure git repository; I could put it up as a gem if someone wants it; the big question is whether there is something better out there. The wasabi WSDL parser has been trying their own mini-framework (partially special purpose) described at this issue. But probably the best I've seen so far is sax-machine (specifically, the lazy option thereto). I haven't spent much time playing with it (at least not yet), but it seems like a better starting point than starting from scratch with a new gem. If you do end up writing code directly on top of SAX, just remember this: keep a stack of start tags and end tags. Following this idiom might cut down on the buggy spaghetti that I've seen when I've tried to do without something like saxophone or sax-machine. Update: I fixed the above link to the wasabi issue, which had changed. Not sure how long-lived any of these links are going to be, but here's another one: lib/wasabi/sax_parser.rb from the sax-parser branch. The key is the stack (pushed on start tag, popped on end tag) and the matchers.

Wednesday, December 07, 2011

count.count

Sometimes you pick a programming idiom because it is what you are familiar with, because you think it is expected, or because it expresses clearly what the code you are writing is trying to do. Other times, it is just too hard to resist. Lately at work at least two of us have seen .count.count in our rails3 code, and at first were sure it must be a typo. The real story is more fun than that, see the nerdfeed blog for more.

Thursday, July 28, 2011

Using active record in rails migrations

Most rails developers have probably sooner or later run into the problem: if your migrations refer to active record classes and the active record classes change out from under the migration, old migrations won't work as desired any more. Whether this is a big problem or a minor annoyance depends on how often you run migrations, how many databases you have (typically one for each developer and one or more you deploy to), etc, but I've seen the problem even over the course of three developer machines and a day or two, as some refactoring made people unable to update their code and then run a only-slightly-older migration.

One solution, advocated in the "Data migrations" section of Code review: Ruby and Rails idioms is just to fall back to writing migrations in SQL, bypassing active record (with the exception of the low-level parts of active record which connect to the database). This has two problems. The first is that active record doesn't help you a lot with this kind of low-level SQL construction. The example in that block post uses string interpolation to construct SQL, which they can get away with in that example (because the columns are integers) but which blows up as soon as the quoting isn't correctly handled (in a migration, this is probably just a bug rather than a security hole, but search "SQL injection" if you are unfamiliar with the problems). The second problem is that active record just is a more expressive way to manipulate data. How many people use script/console rather than script/dbconsole to look around the database?

My recommended solution, also advocated in How to use models in your migrations (without killing kittens), is to define the classes within the migration. There's an example in that blog post, but the short summary is that if, for example, your migration wants to refer to Vendor, you put "class Vendor < ActiveRecord::Base; end" within the migration class. In some cases you might need to define a few has_many or belongs_to relationships (make sure to set class_name to refer to the migration-specific class), but the interesting (and surprising to me) thing is that I've found that in practice you don't need a whole lot of them. Just to give a few examples of what this gets you, think of things like calling find_or_create_by_name to skip creating a record if it already exists, or looking up an object by name and then using its ID in a subsequent SQL statement. If you are thinking "but I can do that in SQL", then I'm not sure I should try to convince you. But if you are thinking "yeah, that is easier / more-concise / more-readable in active record" then defining your classes in the migration gets you both this, and also lets you run migrations even after your code has continued to evolve.