Thursday, December 14, 2006

SQL dumps and topological sorts

A few weeks ago I wrote a SQL dump utility for Mayfly. That is, take a database and dump it out as SQL statements, capable of re-creating the database. I played around with it for a while, thought I had it in a fairly complete state, and didn't think about it much until I was ready to use it (for some database upgrade work which I should write about too, but not now).

Turns out I forgot a key step: try out the dumper on some reasonably large and/or real world data set. In this case, the data set is the MIFOS SQL schema (and master data checked in, which has various records which the application itself needs). And the test is a fairly simple, automated one: dump out the data and try to reload it. This test worked fine with some small SQL scripts I wrote for the Mayfly test suite, but when I dumped the MIFOS schema/data, it failed because the foreign keys were out of order. That is, in the schema, a foreign key has to refer to a table which already exists (constraining the order in which we dump CREATE TABLE statements), and a row has to refer to a row which already exists (constraining the order in which we dump INSERT statements).

Sounds simple, right? Just sort the tables, using a java.util.Comparator which returns -1 or 1 based on whether there is a foreign key between the two tables, right? I'm slightly embarrassed to admit that I actually implemented this, found that it worked on a few small test cases, noticed it failed on MIFOS ("gee, that's funny, I'll need to look into that"), and proceeded to other problems (my immediate need did not demand that the SQL file be reloadable into a database, just that it represent what is in the database).

So here's an exercise for the reader: what was wrong with my implementation? (feel free to answer in the comments section if you want).

Anyway, after some research on Wikipedia, I realized that what I needed was a topological sort (and, fortunately, that I didn't also need something like the Floyd-Warshall algorithm).

Are there libre Java topological sort libraries out there? I didn't look very hard, but didn't see any (on apache commons, I saw something in the "commons sandbox", but it wasn't clear that it was ever finished, or that it even is still there).

I took an hour or two to understand the topological sort algorithm in Wikipedia, and another two hours or so to implement it, and it will probably be another hour or two to have the SQL dump utility call it, so I don't really have a pressing need for a library any more. That is, unless I find maintaining my implementation ends up being a lot of work, which could happen although I suspect I'm over the hard part in just getting to this point. It is one of those algorithms which seems scary until you understand it, but pretty simple thereafter. It's only a hundred or so lines of code and about twice that for tests (depending of course on what you count as the topological sort code and what you count just as setting up various things to send into it).

I was a little bit surprised that a topological sort was so unfamiliar to me. I guess this just means that we don't need it all that often, although when we do need it, we need it bad. By "need it bad", I mean my suspicion is that no amount of ad hoc algorithm building of the sort of "well, these two tables are out of order, let's try switching them" is likely to converge on a working, fast-enough algorithm. Not that I tried the ad hoc approach (unless you count my Comparator misstep). I wouldn't say this example undermines my faith in agile practices like evoluationary design or "running code speaks louder than words" (see for example the line "Too much talk, not enough code. Type!" in the bowling score pair programming example). But it does indicate that there are times when it pays off to stop coding long enough to go read up on how others have solved similar problems.

2 comments:

Brian said...

Actually, it's worse than that: it's possible to create cycles between tables with foreign keys, so some database schemas can't be created using only create table statements. If you create the foreign key constraints last using 'alter table add foreign key', the problem becomes much easier (but the cost is that error-checking happens late).

Jim Kingdon said...

Brian: You are right about the cycles (I try to keep my articles as short/simple as possible, which means I often leave out some details). Now that I have a topological sort, Mayfly will detect cycles and refuse to dump in that case. Someday perhaps I will have a need for it to be smart enough to create a dump which uses ALTER TABLE (or in the case of rows, UPDATE - there are examples of both in the Mayfly test suite: see SqlDumperTest).