Saturday, May 13, 2006

All mashed up.

I must have been looking sideways just lately -- I have been busy! -- because I was surprised this week by several stories about a new idea in databases. Coté over at Redmonk posted links to several stories on his linkroll. Daniel Druker and Robert Rich wrote an article about it in DB2 Magazine, and Bill Snyder piled on in a story he wrote for

This new idea is called Master Data Management, and if you buy the momentum stories, it's the Next Big Thing.

I don't think that the emperor is entirely naked, here, but his mother ought not to have let him leave the house dressed that way. Master Data Management is an old idea in database systems, and all our experience so far says that it's unbelievably hard to do well.

First, though, some context.

Of all the ideas that are currently burbling around in the Web 2.0 cauldron, I personally find mashups to be the most compelling. The best mashups that I have seen so far are based on the Google Maps API. People are building sites that show the locations of their first kiss, local public libraries, sex offenders living in the area and more. The idea is to use simple, standard web-based interfaces to combine data from one site with map data from another. Once you internalize this idea, you realize that there are lots of different data sources out there that you'd like to tie together.

The old-fashioned name for this discipline among database researchers is federated databases. The idea is to take a collection of databases, created and maintained by different organizations for different purposes, and combine the information that they store in interesting ways. Much research money, and some investment capital, has been plowed into this idea, with (so far) no big bang. Those efforts have not been a complete bust, but in more than a quarter century of work, no single general-purpose technique has been discovered that works well.

The problem is that the different groups who build and maintain these databases collect and store information with different assumptions. Is my first name "Mike" or "Michael"? Are the prices you publish in euros or yen? Are dates represented in American or European format? The answers make a difference if you're combining records from different sources.

Worse, the reliability of the combined data is generally worse than the reliability of data in any single database. If my phone number is wrong in one database, and my age is wrong in another, then the combination is wrong in two particulars, not just one.

While inaccuracies like that may seem unimportant, they can matter a great deal. One of the example stories for the success of MDM is the casino that recognized a card cheat by a match on his telephone number with a different casino's employee database. Think about it: Do you know anyone that has written your telephone number down wrong? Would you want the companies you do business with to make decisions about you based on information that may be wrong, and that you can't review and correct?

It's absolutely possible to handle these issues, especially for single companies combining data that's all under their control. The established database vendors all offer products that do this, but they require careful analysis and considerable effort on installation. The information they operate on needs curation.

Mashups are much too powerful an idea to constrain to mapping apps. We'll see more, and more interesting, examples. Some will certainly tie together legacy data from a variety of sources, including relational database systems. This is an old technique, though, with a lot of practical experience highlighting problems in the field. Web 2.0 apps that use the technique but ignore the experience are going to deliver wrong answers.

Don't believe everything you read on the internet.


Post a Comment

<< Home