30 January 2019

MD5: nice to see you here, old friend!

I recently assembled a new desktop computer system. In the course of moving my data to it from its predecessor, I managed to corrupt the database that my photo organizer (Shotwell) uses to manage my photo collection. I rebuilt a new database from the photos themselves, but this lost metadata like comments and edits that I'd applied. I hoped, though, that I'd be able to recover that information later from the older versions of the database tables. I managed to do this, thanks in part to one, er, key element in the database structure.

In Shotwell's database, each photo has a row in the PhotoTable, with many columns containing information about it. There's a unique ID for each photo, but the IDs generated as photos were imported into the earlier database couldn't be assumed to be the same as when I reimported them into the new database. It would clearly be a Bad Thing to apply the tags for photo #457 in the old database to photo #457 in the new database. What to do?

Looking at the PhotoTable columns, I noticed that each photo had an entry for an MD5 hash of the image. Hash functions are great and useful things. It's unlikely (and, I mean, highly, probabilistically, unlikely) that I'm going to encounter two different images in my collection that yield the same MD5 value. (Even though MD5 isn't recommended today for security-relevant applications, it's still doing its job here in distinguishing among image files that came out of my cameras, which haven't generally acted as hostile attackers.) I expect that Shotwell's code uses the stored MD5 as a quick and effective means to determine whether or not a photo has already been imported into its database. When I saw the MD5 column in the table, I realized that it also provided me with a means to find the correspondence between photo entries and their IDs in the old table with their entries in the new table. Thusly armed, SQL of this form followed:

REPLACE INTO PhotoTable ( named-columns )
SELECT named-columns
FROM old-PhotoTable src
INNER JOIN PhotoTable dest ON src.md5 == dest.md5

which took less than a second to replace corresponding metadata into a table representing about 23,000 photos. I restarted Shotwell with the resulting table, and found my edits accurately restored. I was pleased to have been able to accomplish this. I was glad to have been using an open source organizer with an accessible and documented database representation, and emerged with refreshed respect for the power and value of hash functions.

2 comments:

  1. Something I like about my Wordpress sites is that they support silly little passive social media things, like signalling "like" even if you have nothing interesting to say about the post.

    This is a delightful application for MD5 (or maybe even MD4).

    ReplyDelete
  2. Yes. I also like using it to see whether copies of (potentially large) files I’ve placed on different systems match, without having to move their bits back and forth just to be able to compare them. I suppose that in principle someone could construct a collision and insert a rogue file, but that seems an unlikely threat within my personal domain.

    ReplyDelete