Tuesday, December 19, 2006

Database upgrades

Well, my project for the last few weeks has been database upgrades. This concept has been popularized by rails and to a certain extent the Database refactoring book, but it is something that just about any project could do, without a whole lot of mechanism. It only took us (myself and Stelligent) a few days to implement an automated upgrade scheme.

I guess the best way to describe the design is just to point to the Wiki page.

Some of the key decisions were/are:

(1) sql vs java. Is an upgrade script just a .sql file, or is it written in a real programming language like Java/ruby/perl/etc? We went with the .sql approach because it seems like the Simplest Thing Which Could Possibly work, and it seems like introducing a programming language could get needlessly complicated and/or hard to understand. Having said that, I do realize that we may someday hit a situation where it is really awkward to process data via SQL. But I'd rather cross that bridge when we come to it, given how many upgrades so far have been very simple (insert default value into new column, create new table with no rows, etc).

(2) Whether to provide upgrade scripts for every checkin which changes the database schema, or on coarser boundaries like releases? The latter is
sufficient if the goal of this exercise is merely to allow production users to upgrade. Requiring developers to blow away their test data can be done, but the real question I have there is how is a developer supposed to know when they need to do this. I've been on projects where a significant amount of the chatter in the team room is "is anyone else seeing a failure x?" "rebuild your database". This doesn't really seem great even if everyone is in the same room, and on a distributed project like MIFOS, it seems to me quite important to at least detect an out of date database on developer machines (and once you've done that, might as well just do the upgrades).

(3) Whether to provide downgrade scripts. The purpose, I guess, is to let someone try out a new version of the software knowing they can always go back if it didn't work out. Given all we are trying to do in terms of having tests, continuous integration, etc, to make upgrades go smoothly without needing to roll back, I'm not sure how much effort to put into downgrades. On the other hand, it may be hubris to think downgrades will never be needed.

(4) What tests to build. The interesting tests in the MIFOS case are in LatestTest. There are tests of the upgrade scripts (checking against latest-*.sql), the downgrade scripts (checking each one undoes the corresponding upgrade script). We're also thinking about some tests which would test that the upgrade scripts properly upgrade existing user-supplied data (as opposed to just schemas and data supplied with MIFOS). But this post is getting long already, so I'll leave that for another time (or for people to ask about).

Thursday, December 14, 2006

MIFOS programmer intern position in Washington, DC

If you are a programmer looking for an internship in Washington, DC, check out Software Development Intern, Mifos. Much of the time would be spent pairing with me and/or working on some of the things I write about here.

SQL dumps and topological sorts

A few weeks ago I wrote a SQL dump utility for Mayfly. That is, take a database and dump it out as SQL statements, capable of re-creating the database. I played around with it for a while, thought I had it in a fairly complete state, and didn't think about it much until I was ready to use it (for some database upgrade work which I should write about too, but not now).

Turns out I forgot a key step: try out the dumper on some reasonably large and/or real world data set. In this case, the data set is the MIFOS SQL schema (and master data checked in, which has various records which the application itself needs). And the test is a fairly simple, automated one: dump out the data and try to reload it. This test worked fine with some small SQL scripts I wrote for the Mayfly test suite, but when I dumped the MIFOS schema/data, it failed because the foreign keys were out of order. That is, in the schema, a foreign key has to refer to a table which already exists (constraining the order in which we dump CREATE TABLE statements), and a row has to refer to a row which already exists (constraining the order in which we dump INSERT statements).

Sounds simple, right? Just sort the tables, using a java.util.Comparator which returns -1 or 1 based on whether there is a foreign key between the two tables, right? I'm slightly embarrassed to admit that I actually implemented this, found that it worked on a few small test cases, noticed it failed on MIFOS ("gee, that's funny, I'll need to look into that"), and proceeded to other problems (my immediate need did not demand that the SQL file be reloadable into a database, just that it represent what is in the database).

So here's an exercise for the reader: what was wrong with my implementation? (feel free to answer in the comments section if you want).

Anyway, after some research on Wikipedia, I realized that what I needed was a topological sort (and, fortunately, that I didn't also need something like the Floyd-Warshall algorithm).

Are there libre Java topological sort libraries out there? I didn't look very hard, but didn't see any (on apache commons, I saw something in the "commons sandbox", but it wasn't clear that it was ever finished, or that it even is still there).

I took an hour or two to understand the topological sort algorithm in Wikipedia, and another two hours or so to implement it, and it will probably be another hour or two to have the SQL dump utility call it, so I don't really have a pressing need for a library any more. That is, unless I find maintaining my implementation ends up being a lot of work, which could happen although I suspect I'm over the hard part in just getting to this point. It is one of those algorithms which seems scary until you understand it, but pretty simple thereafter. It's only a hundred or so lines of code and about twice that for tests (depending of course on what you count as the topological sort code and what you count just as setting up various things to send into it).

I was a little bit surprised that a topological sort was so unfamiliar to me. I guess this just means that we don't need it all that often, although when we do need it, we need it bad. By "need it bad", I mean my suspicion is that no amount of ad hoc algorithm building of the sort of "well, these two tables are out of order, let's try switching them" is likely to converge on a working, fast-enough algorithm. Not that I tried the ad hoc approach (unless you count my Comparator misstep). I wouldn't say this example undermines my faith in agile practices like evoluationary design or "running code speaks louder than words" (see for example the line "Too much talk, not enough code. Type!" in the bowling score pair programming example). But it does indicate that there are times when it pays off to stop coding long enough to go read up on how others have solved similar problems.