Monday, January 15, 2007

Machine Learning and Artificial Intelligence

Machine Learning is a topic that has come up a lot during my CDI (Customer Data Integration) research. Specifically when diving into the domain of Probabilistic Matching. From the Initiate website:

Probabilistic matching

Probabilistic matching uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links than deterministic systems between records that have more complex typographical errors and error patterns.

Typically, probabilistic systems assign a percentage (such as 75 percent) indicating the probability of a match. Because these systems pinpoint variation and nuances to a much finer degree than a deterministic approach, they are better suited for businesses that have complex data systems with multiple databases. Due to the size of these data systems, the potential for duplicates, human error and discrepancies is far greater, making a system designed to establish links between records with complex error patterns much more effective.
Probabilistic matching enables one to use match scores and percentages on a field by field comparison to determine a match. There are three categories output from probabilistic matching and are set by the user based on the overall probability percentage:

1) Match - that can be automatically merged
2) Candidate Match - requiring manual review
3) Non Match

The topic of Machine Learning and artificial intelligence enters in with the manual review process of the candidate matches. The idea is that computer can learn from the users decisions of what was manually determined to be a match or non-match and build these into its future decisions and probability scores.

If an expert system--brilliantly designed, engineered and implemented--cannot learn not to repeat its mistakes, it is not as intelligent as a worm or a sea anemone or a kitten.
-Oliver G. Selfridge, from The Gardens of Learning.

"Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever."
- Oliver G. Selfridge, in AI's Greatest Trends and Controversies

Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously self-improve and thereby offer increased efficiency and effectiveness.

(read full article on Machine Learning here)


Companies like Purisma and Siperian offer machine learning techniques built into their software.

A primary goal of machine learning would be to reduce the amount of manual review needed to determine matches and continuously improve the software's ability to accurately detect and consolidate duplicate records.

An open source tool that can be used for probabilistic record matching is Febrl (Freely Extensible Biomedical Record Linkage). Written in Python anyone can download this from sourceforge.net and get your feet wet with probabilistic matching.