Tuesday, January 02, 2007

Matching Algorithms for CDI

It generally understood that between 2%-5% of a customer database will contain undetected duplicates after a standard merge/merge routine is run. As stated in CRM Today, most merge-purge algorithms are 20 years old...

Most merge/purge processes were developed over 20 years ago and were never conceived to recognize the fluidity of movement, name change, and channels in which customers interact today. Even the most advanced de-duplication processes use character based logic and look up tables that are ill-equipped to assess the totality of a customers’ name and address permutations that accumulate through multiple customer interaction channels. These processes are easily deceived by minor variations in the name and address elements such as married/maiden, nick names, typos, and mis-keys. A typical file will contain 2 to 5% unidentified duplicate customers after a standard merge/purge process.
Instead of the common approach of only using one matching algorithm (I am guilty of this one) modern approaches are utilizing many algorithms to isolate duplicate records.
No single algorithm can efficiently and effectively power a matching technology due to the multiple culprits of customer identity and data quality error. Advanced solutions incorporate not one, but several advanced matching algorithms, each designed to group records into temporal data sets for the purpose of bringing visibility to distinct patterns of repetitious error. Once patterns of error are identified, the records can be referenced to consumer data sources, allowing for the remediation of the error and recognition of the true identity of the customer.
An integral part of any CDI product is its matching capability. Match rules will vary from organization to organization so using a standard template or built-in rules usually will not suffice if you want thorough duplicate detection and handling.

Firstlogic (now Business Objects) has a sophisticated match engine and beyond the standard match capabilities are "Extended Matching" which allows the user to set up custom rules for different match scenarios. I believe that simply using rule based matching you can solve 80% of your data consolidation issues.

Companies like Purisma or Siperian offer much more sophisticated match engines that utilize match clusters and larger "match footprints" which ease the process, but for the most part, if you know what you are trying to do, any rule based matching engine will do. And remember:
"The computer is no better than the organization that feeds it." - L. Ron Hubbard
The moral is first work out all your match rules and then see if you can solve your problems using your existing match engine software before looking elsewhere to solve these match related problems.