Sunday, January 28, 2007

Record Linkage and List Quality

What is Record Linkage:

Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the “standard” probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.


Other names that mean the same thing: entity heterogeneity, entity identification, object isomerism, instance identification, merge/purge, entity reconciliation,
list washing, match/consolidate and data cleaning. I like the term "record linkage" and will refer to it as such from this point forward in this blog.

Seems clear that if you want to be thorough in your record linkage efforts you would implement a combination of deterministic and probabilistic matching methodologies. Using a straightforward name and address match such as the Firstlogic (now Business Objects) approach will usually be sufficient if you are a list vendor or mailhouse. But if you are at all serious about identity management you will step into the deep end and implement a probabilistic matching method. I've downloaded the Ferbl open source probablistic matching tool, but have yet to experiment with it.

But whatever method you use or software package you buy, the quality of your record linkage always ends up in how good you have configured your rules. This is not an off-the-shelf solution - it requires work.

I bet most organizations "record linkage problems" could have been avoided if enough forsight and initiative had been put on the original database systems, in establishing that unique key.

Well, I guess hindsight is in fact - 20/20.

One product that definitely contributes to an increased percentage of record linkage is the SSA-NAME3 product. As you'll see from their site, they have developed this name search tool which uses probabilistic matching but factors in how common or uncommon the name is. For example matching two Jose Garcia's in the city of Los Angeles is not the same as matching two Jose Garcia's in Iceland. They are probably the same person in Iceland and most likely not, in LA.