Sunday, December 23, 2007

Record Linkage Resources

Download whitepaper on "Approximate Duplicate Detection" which opens with the following:

Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same real-world entity because of data entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple
databases – such as what happens in data warehousing where records from multiple data sources are integrated into a single source of information – among other reasons. In this paper we review a system to detect approximate duplicate records in a database and provide properties that a pair-wise record matching algorithm must have in order to have a successful duplicate detection system.

Download paper on probabilistic record linkage and an automated procedure to minimize the undecided-matched pair problem. Here is an introduction to this paper:

Probabilistic record linkage allows the assembling of information from different data sources. In this article, we present a procedure in case a one-to-one relationship between records in different files is expected but is not found only by applying the probabilistic record linkage methodology. Our data were births and infant deaths from the 1998-birth cohort whose mother’s place of residence was the City of São Paulo at the time of birth. Our assumption was that pairs for which a one-to-one relationship was obtained, and a best-link was found with the highest combined weight would be considered as uniquevocally matched pairs or gold-standard and should then provide information in order to decide about pairs in which such a relationship could not be established. For example, we observed that the for the unequivocally matched pairs a clear and expected relationship between differences in dates of death and birth registration could be assessed. As a result, such a relationship was used to help solving the remaining pairs for which a one-to-one relationship could not be found. Indeed, we reduced the number of non-uniquely matched records and even though we could not establish a one-to-one relationship for every single death we reduced the number of uncertain. We suggest that future research using record linkage should use combined strategies from results from first record linkage runs before a full clerical review (the standard procedure under uncertainty) in order to most efficiently (and less costly), retrieve record matches.

Applications of Record Linkage Techniques

Here is an excerpt:

Record linkage is simply the bringing together of information from two or more records that is believed to relate to the same entity, e.g., the same individual, the same family or the same business. This may entail the linking of records within a single computer file to identify duplicate records. Alternatively, record linkage may entail the linking of records across two or more files. The challenge lies in bringing together the records for the same individual entities. Such a linkage is known as an exact match. The task is easiest when 1) the files have nearly unique identification numbers (e.g., social security numbers), 2) information is recorded in standardized formats and 3) the files are small. In the absence of nearly unique identifiers, names, addresses, dates-of-birth or other indirect identifiers are frequently used in the matching process.

record linkage description

Here is a good description of "Record Linkage" from the University of Manitoba website:

Record linkage is simply the integration of information from two independent sources. Records from the two sources that are believed to relate to the same individual are matched in such a way that they may then be treated as a single record for that individual. Records brought together in this way are said to be linked. The principles of record linkage may be applied to any field in which it is necessary to bring together information recorded about persons in different places or at different times.

Thursday, December 20, 2007

What is the Perfect Computer?

As a computer programmer, I was intrigued by the opening chapters of Dianetics, the Modern Science of Mental Health, by L. Ron Hubbard claiming that man how the "Analytical Mind" is in fact a perfect computer. It makes calculations based on the data it has. It is ALWAYS RIGHT.

But like a computer program with unfortunate bugs, the calculation may come out with a wrong conclusion, but based on data it has, the program is still "right".

There are two parts to one's mind. The Analytical and the Reactive Mind. Dianetics says that the Analytical Mind IS the Perfect Computer.

"What would you want in a computing machine? The action of the analytical mind—or analyzer—is everything anyone could want from the best computer available. It can and does do all the tricks of a computer. And over and above that, it directs the building of computers. And it is as thoroughly right as any computer ever was. The analytical mind is not just a good computer, it is a perfect computer. It never makes a mistake. It cannot err in any way so long as a human being is reasonably intact (unless something has carried away a piece of his mental equipment).

"The analytical mind is incapable of error, and it is so certain that it is incapable of error that it works out everything on the basis that it cannot make an error. If a person says, “I cannot add,” he either means that he has never been taught to add or that he has an aberration about adding. It does not mean that there is anything wrong with the analytical mind."
- L. Ron Hubbard, Dianetics, The Modern Science of Mental Health

If this were true, how is it possible that people make mistakes, wrong decisions and don't live "perfect" lives? Or more to the point, how does one prevent writing buggy computer software? The answer lies in Dianetics and specifically in the chapter on the Reactive Mind. Dianetics should be required reading for any software developer. It would not only result in less buggy software being developed, it would make for a world of less stressed out programmers!

Anyone interested in Dianetics can get a copy of this book here.

Tuesday, October 30, 2007

Format your Code for Blogger

Finally, I found a solution of how to easily format my Source Code examples for Blogger. I've seen many examples of how other blog services handle this, such as Wordpress, but I don't use Wordpress...

Here is how to easily do this, until Blogger comes up with an easier way:

Copy your source code example and paste it into this page:

Source Code Formatter for Blogger

Click on the "Format Code" button and copy the results.

Switch to "Edit Html" view in your Blogger Edit Post page and paste in the formatted code example.

Here is what a sample PL/SQL example would looks like:

FOR cur IN (select id from test)
UPDATE table SET column = "DONE" WHERE id = where;

Thanks to Greg Houston for this simple solution.

Friday, October 26, 2007

How to Use JList - for Dummies

I've built a few Java programs and am still not proficient with my tools, so I am laying out a series of simple instructions for learning and using the basic Java Swing tools. This will reference the websites where I go for direction as well as examples that I've found useful.

The first place I always start is the Java tutorials on using Swing:

How to Use Lists

I ran into a situation where I had to remove multiple elements from a JList. Here is the exact trick on how to do that. It worked perfectly:

Remove Multiple Items from a List

Thursday, October 25, 2007

Percentage of Americans who do not fill out a Change of Address with USPS

Here is an excerpt giving an estimate of how many Americans move each year that do not fill out a change of address request with the post office. These address changes would not be returned with NCOA and would require other means of address recovery such as a "Deluxe" NCOA service that utilizes other address change databases such as magazine subscription lists, utility company databases and even the "Pizza Hut" database they maintain of deliveries they've made and address changes they've maintained.

One of the fallacies surrounding consumer list compilation is that the U. S. Postal Service’s National Change of Address file is the answer to this problem. The fact is, however, that NCOA, while helpful, falls far short of the accuracy that most of us would prefer. An estimated 30 percent of the people who move in the United States never file a change-of-address form with the postal service. And even when they do, it is sometimes weeks or even months before that information is made available to list compilers or the direct marketers who use the lists.

Competent list managers have created routines that address this problem. For example, if there is a change of address on a current, reliable source (such as new billing addresses for a charge account or new utility hookups) the new address might be placed in the consumer record. This type of effort can significantly improve address quality on a mailing list – but it cannot ever be perfect.

Wednesday, October 24, 2007

Jlist or Jtable - I need a multi-column display

I'm gradually coming to grips with programming user interfaces in Java. Read a bit on the MVC (Model View Control) concept and while I see its value, it is a bit too steep a gradient at this point. I am starting off getting the basics of the Swing GUI and have been using Jgoodies with the Jigloo Windows builder. Even that is a bit steep as I want to know how to build these from the ground up and using the drag and drop tools resulted in skipping some of the basics that left me hanging, when trying to modify and customize what Jigloo provide.d

What has helped learn this the fastest has been to work out some simple application that will be useful and then laying out a UML diagram including Use Case for the requirements, Activity Diagrams and Class Diagrams. A programmer is lost in the woods without these design tools. Then I've taken one element at a time, the JDBC database access, pulling up a list and displaying this, etc.

I am up to a point where I need to display a simple list of entries from a database. The list should have 3 columns. I've only been able to get this to work with one column so far. So what I am looking into is whether it would be better to just use the JTable or figure out a "multi-column" JList.

Here are some resources to figure this out:

Code Guru
Java Sun
Dream In Code
Jlist Tutorial

One simple method of doing this would be to select your multi-column data from your data source and return this resultset to a String array.

Refresh a JTable from a JDBC connection

I have a Swing application that uses a JTable and JScrollPane to display data from an Access database via a JDBC connection.

Even though I had properly coded the SQL INSERT statement and the data was loading into my database, the JTable was not refreshing with the new record. I tried the .revalidate() and the .repaint() methods to no avail.

Finally, after reading Ben Hendry's Mastering JTable I was able to sort this out.

In my JButton ActionListener I added the TableModel.addRow method and inserted the newly inserted records into the JTable as well. Followed by a .revalidate() and the data properly displayed without even a flicker in my application.

Moral of the story - always use Model-View-Control (MVC) architecture and separate your view from your data and controls.

Saturday, August 04, 2007

Switzerland Language Assignment by Postal Code

Map of Switzerland showing language areas.

Language assignment by postal code:

Postal codeAreaLanguage
1000 LausanneFrench
2000NeuchatelFrench and German
3000BernFrench and German
9000St. GallenGerman

Thursday, July 26, 2007

Language Assignment By Postal Code

Every so often I run across a need to assign languages by postal code. Currently I need to assign languages by the Belgium postal codes. The following table shows the prominent language for each postal code of Belgium:

Capital LanguagePostal codes
Antwerp DUTCH2000-2999
Ghent DUTCH9000-9999
Leuven DUTCH1500-1999, 3000-3499
Mons FRENCH6000-6599, 7000-7999
Liege FRENCH4000-4999
Hasselt DUTCH3500-3999
Arlon FRENCH6600-6900
Namur FRENCH5000-5999
Wavre FRENCH1300-1499
Brugge DUTCH8000-8999

Here is a map showing the language breakdown of Belgium:

Here is a language assignment by postal code for Switzerland.

Monday, July 23, 2007

Salutations Used in Direct Mail

I am working on a direct mail project that requires personalized letters addressed to all manner of people from around the world - from members of Congress to Doctors, lawyers, members of parliament, etc. and these letters are being translated into multiple languages. I am searching for an already existing table of translated titles of address or salutations.

The following pages include fairly complete lists of salutations but I am hunting around for more complete and then translated tables like this. If anyone out there knows of such a table, please let me know.

Table with Forms of Address
Official Forms of Address

Sunday, June 17, 2007

Char_Instr - Another Matching Algorithm

I've using Soundex, Double Metaphone, Levenshtein Distance and other string matching algorithms, but somethings it is the simple things that work best.

Here is a function I wrote in PL/SQL utilizing the INSTR() function to traverse through one string, checking each character as to whether or not it exists in the another string.

This has been useful in checking for duplicates in mailing lists as well as identifying potential duplicates that should be reviewed. The function returns the number of characters from the second string that are the same as the first string.

I've found the use of the Soundex to be too loose and the Levenshtein Distance algorithm to be lacking in its handling of transposed characters. Use of this function will give you a count of how many chars are the same regardless of position in the string.

Function: char_instr (str1, str2)

Purpose: To check how many characters from one string

exist in another string. The return value is the number
of characters that exist in the in both strings.

FUNCTION char_instr
v_exists NUMBER := 0;
v_occurs NUMBER := 0;
v_length NUMBER;
Get the length of the str2
v_length := LENGTH(str2);
Loop through str2 and check if char exists in str1
IF( v_length > 0 ) THEN
FOR x IN 1..v_length LOOP
v_exists := INSTR(UPPER(str1), SUBSTR(UPPER(str2), x, 1));
IF(v_exists>0) THEN
v_occurs := v_occurs + 1;
RETURN v_occurs;
END char_instr;

Thursday, June 14, 2007

SQL Cheat Sheet

I've put together this page to give quick links to all the Oracle SQL tricks that I find myself continuing to search for on the net. This is a quick reference guide and will be expanded regularly.

Oracle SQL Cheat Sheet

SQL Introduction and Definitions


Create a Delimited Output in SQL Plus

How to Rename a Column in Oracle SQL

How to Delete a Column from a Table

Use of Variables in SQL Plus or PL/SQL

Format Query Results to Display Currency

Display Seconds Past Midnight

Select Only Top Results in a Query

Use of Wildcards in Oracle SQL

What is NVL and how does it work?

How to Use Decode in Oracle SQL

How to Round Off a Number in Oracle SQL

Debug JDBC Connectivity with Oracle Thin Client

Oracle String Matching Algorithm - CHAR_INSTR

If you have any other tips or suggested entries, let me know!

Saturday, June 02, 2007

Nice Bike Tom!

Check out these shots of Tom Cruise's new bike. I hear it has a Ducati engine and was custom built for him, sized to fit.

Tuesday, May 29, 2007

Standard Mail $.02-$.03 rate reduction proposed by PRC


This morning I received the following pleasant news update from the DMA regarding a temporary rate reduction proposed on all Standard USPS mail:

The Postal Regulatory Commission May 25 offered interim rate relief for Standard Regular flats and catalog mailers in response to the U.S. Postal Service Governors’ request to reconsider the recommended rates in March.

The PRC’s decision grants a temporary rate reduction of 3 cents for all Standard Mail regular flats and 2 cents for Standard Regular nonprofit flats. The temporary transitional rate relief for Standard Regular flats mailers would end Sept. 29.

click to read full story...

They would not require any software change be made but simply a blanket 3 cents per Standard Mail Flat and 2 cents per Non-Profit Standard Mail Flat to be deducted from the overall postage statement.

Mailers concerned should contact their USPS postal reps and support this PRC recommendation.

Thursday, May 24, 2007

How to Round Up in Oracle SQL

Recently I needed to provide a report that had numbers rounded to the nearest thousand. I found the ROUND command but this appeared to be only rounding off decimal points to a specified number of spaces. The syntax for this is:

SELECT ROUND(number, decimals)
FROM table;

If you use 2 for the decimals your result would look like this:
Result: 2.12
Result: 2.13

However, I needed something that would round to the nearest thousandth. So I tried a negative number for the decimals like this and found it did exactly what I wanted:
Result: 10000
Result: 11000

While this is a basic Oracle SQL function, it is not clear that you can do this, hence this post.

Thursday, May 17, 2007

USPS Rate Change - What do you think?

May 14th, 2006 will go down in history as the most confusing day in mailers history. Unfortunately many people did not install the new presorting software on development machines and get them fully tested before the big date and tried to frantically switch over on May 14th. The tech support team at Firstlogic was overwhelmed with calls - specifically in their Presort support dept. Complaints regarding the post office rejecting their mail, confusing new requirements and general upset due to the increase in postage as a result of the new rate case. Mailers are experiencing anywhere upwards of $.10 per piece increase - the highest increase in standard mail that I can remember.

I saw a post on the DM News website that the USPS claimed the new rate case implementation was going smooth - however it is quite different on the ground in speaking to mailers and software support technicians.

Per the below excerpt from DM News page, it looks as though the Postal Regulation Committee is considering reverting the Standard Flat Mail rate change due to the "rate shock" occurring:

The Governors of the USPS asked that the rates for this category of mail be reconsidered because the price increases recommended by the PRC may impose an unnecessary degree of “rate shock” on the catalog industry and small businesses particularly. The recommended increase for some catalog mailers is as much as 40 percent, which is more than double what the Postal Service had proposed.

Here is a poll to find out your opinion of the new rate case. Please vote and let me know your comments on this.

Tuesday, April 24, 2007

Is anyone teaching us Morals anymore?

Back in the day, religion had its place in society of instilling moral values in our youth. With the separation of Church and State and the resulting exclusion of any religious material from the classrooms, a void was unintentionally forged - or should I say a chasm.

In the last few years, we've heard more horrific stories of youth violence and their apparent lack of any moral sense. Look at the Columbine shootings or more recently the Virginia Tech rampage. Cho certainly had no sense of morals. Maybe this was the result of the psychiatric - mind-altering drugs he was on. He certainly would have been schooled in the generation where morals were stripped from our classrooms.

Who is teaching our children morals these days anyway?

There is a organization paving the way towards getting Moral values back into society. The Way to Happiness is an organization that promotes "Moral Values for a Modern World. This is a non-religious moral code that anyone can follow.

They have recently launched a campaign of Public Service Announcements that promote this moral code in terms anyone can understand.

Here is a link to a video where you can watch one of these PSAs promoting one of these precepts, Be Competent.

Tuesday, April 03, 2007

Why Global Data Integrity is Important

Why Global Data Integrity is Important

Saturday, March 24, 2007

Various Matching Algorithims

Implement Phonetic ("Sounds-like") Name Searches with Double Metaphone Part VI: Other Methods & Additional Resources - The Code Project - String

I have only experimented with Levenshtein Distance and Soundex algorithms. This page gives a free download of the Double Metaphone algorithm implementations as well as several other samples that are very useful.

Monday, March 19, 2007

Oracle Thin Client JDBC

I've had problems setting connections to an Oracle DB using the thin client JDBC. This post is to track how this was solved.

First you must have the oracle.jar file in your build path.

Next I used the following for the actual connection:

public static Connection getConnection() {

Connection con = null;
} catch (Exception e) {
System.out.println("Failed to load driver.");
return con;
try {
con = DriverManger.getConnection(
} catch (SQLException ee) {
return con;

I also had imported java.sql.*;

The thing that was causing a problem with this before was that I did not have the host name, port or SID listed out. This was easily located in the tnsnames.ora file in the oracle installation folder.

Wednesday, March 14, 2007

Kelly Preston, Kirstie Alley and Priscilla Presley at CCHR Awards Banquet

Citizens Commission on Human Rights acknowledged leaders in the field of human rights last month and awards were presented by Kelly Preston (John Travolta's wife), Priscilla Presley (Elvis Presley's wife) along with Kirstie Alley and Marisol Nichols.

Celebrities present human rights awards to individuals who risked their careers to warn the public about psychiatric drug risks, and urge parents to get informed.

LOS ANGELES: On Saturday, February 17th, at the annual awards banquet of the psychiatric watchdog Citizens Commission on Human Rights (CCHR), actors Kirstie Alley, Kelly Preston, Priscilla Presley, Marisol Nichols (24) and Anne Archer (Patriot Games, Fatal Attraction) awarded mental health industry whistleblowers who have risked their professional careers to warn the public about the dangers and fraudulent marketing of psychiatric drugs. Exemplifying the effort to inform parents and consumers, former pharmaceutical sales representative Gwen Olsen, and former Pennsylvania government investigator Allen Jones, were presented with CCHR’s annual Human Rights Awards for their courageous contributions to mental health reform.

Like Russell Crowe’s character in the film The Insider, Allen Jones reluctantly stepped forward as a whistleblower. Kirstie Alley presented Jones with his Human Rights Award for exposing several pharmaceutical companies who bribed Texas government officials to implement mental health treatment laws requiring the use of their brand name drugs.

Kelly Preston, Kirstie Alley and Priscilla Presley Colin and Marisol Friedberg with Anne Archer

Kelly Preston, Kirstie Alley and Priscilla Presley Colin and Marisol Friedberg with Anne Archer
(Click to enlarge)

Recently, a four-year-old Massachusetts girl died from a deadly cocktail of psychiatric drugs, calling into question the psychiatric practice of drugging small children with powerful anti-psychotic drugs. Last week, the story ran on the front page of The New York Times and was featured on Fox National News, drawing national attention to the culpability of the prescribing psychiatrist, and the growing controversy over psychiatrists drugging children and toddlers with powerful drugs.

Magnetic Card Reader Parsing of Track Data

I've been working on developing an application that will read a driver's license and parse the information and populate a data entry screen. This is useful for rapidly collecting name and address information when you have drivers licenses available.

To work out the exact format of drivers license magnetic strips, the following links were very helpful:

Drivers license format
Magnetic Strip Encoding Standards
Mag Stripe Cards

Java Magnetic Card Parser
Exeeba AAMVA Drivers License Format
CA DL Format

Many states do not conform to the above standards but generally these will work.

Wednesday, February 28, 2007

Java Case or Switch Example

How do you use the CASE function in Java?

The Switch or Case function in Java is useful for branching off to different actions based on a conditional expression. For example if you have a series of categories that can be selected and depending on which category was selected, you want the program to do a different action, the Case function is the best way to do this.

Here is the syntax

    switch( expression ) {
case value1:
case value2:
case value3:
case valuen:
statement n;

A real example of this from the Sun website:

int month = 8;
switch (month) {
case 1: System.out.println("January"); break;
case 2: System.out.println("February"); break;
case 3: System.out.println("March"); break;
case 4: System.out.println("April"); break;
case 5: System.out.println("May"); break;
case 6: System.out.println("June"); break;
case 7: System.out.println("July"); break;
case 8: System.out.println("August"); break;
case 9: System.out.println("September"); break;
case 10: System.out.println("October"); break;
case 11: System.out.println("November"); break;
case 12: System.out.println("December"); break;
default: System.out.println("Invalid month.");break;

Tuesday, February 27, 2007

String Parsing in Java

In Awk you can easily find the position of a delimiter and then take all characters up to that point by using the index() and substr() functions. But how does that work in Java? Is there an easy way to parse a String and carry out these type of routine actions?

Here you go:

subtring method:

int start = 1;
int end = 4;
String substr = "aString".substring(start, end); // Str
index method:
String string = "madam, i am Adam";
// First occurrence of a c
int index = string.indexOf('a'); // 1
Here is a live example I am working with where I need to parse the String into two variables as my webpage is passing me a parameter that is pipe-delimited:

String str = "YES|12345";

int pipe = str.indexOf('|'); //4

String firstPart = str.substring(1,pipe-1);
String lastPart = str.substring(pipe+1);


The DECODE function in Oracle SQL is used for if-then-else statements. A very handy tool for powerful queries.

The syntax for the DECODE function is as follows:

DECODE(value, if1, then1, if2, then2, if3, then3, etc., else)

Here is a good example of this from the website:

Consider the following example in which the Viewable column can have values of 1, 2, and 3:

SELECT FirstName, Viewable
FROM employee;


John 1
Tim 2
Julie 2
Stacy 1
Rahul 3
Leena 4
Amy 1
Bill 3
Teri 3

Now, we can use Decode to display different things in a report based on the values in Viewable.

SELECT Firstname,
Decode(Viewable, 1,'VISIBLE', 2,'INVISIBLE',
FROM employee;



Decode checks the column values and interprets the provided values in pairs. This is how it works:

Switch viewable:

Case 1:
Result = VISIBLE
Case 2:
Case 3:
Result = UNKNOWN
Result = OTHER
End Case


If Viewable = 1 Then
Result = VISIBLE
Elsif Viewable = 2 Then
Elsif Viewable = 3 Then
Result = UNKNOWN
Result = OTHER
End If

One can use Decode in Updates as well. For example, instead of writing 50 different updates to change state names, we can do it in one Update using Decode as follows:

Update employee
set homestate = decode(homestate,
'AL', 'Alabama',
'AK', 'Alaska',
'AS', 'American Samoa',
'AZ', 'Arizona',
. . . . . . . .
. . . . . . . .
'WA', 'Washington',
'WV', 'West Virginia',
'WI', 'Wisconsin',
'WY', 'Wyoming' , homestate)
where homecountry = 'UNITED STATES'

This will replace state abbreviations with state names. And it leaves the abbreviations alone in case it doesn't match any of the values, such as AL, AK, etc.

Most Oracle SQL users do not realize the power of this function. Hopefully this post will enlighten some on it.

Sunday, February 25, 2007

Scientology Edinburgh

A new website was just put up for the Church of Scientology in Scotland. The Volunteer Ministers are really needed there right now with the recent train wreck.

Tuesday, February 20, 2007

Google Directory - Computers > Data Formats > Markup Languages > HTML > Tutorials

I figured its about time I learned HTML. Here is a great starting point with a plethora of tutorials, examples, etc.

Google Directory - Computers > Data Formats > Markup Languages > HTML > Tutorials

Submit Links Directory

Wednesday, February 14, 2007

Moving Data around HTML pages with JSP

Ok, I admit I never programmed a web site before. But I've recently realized the potential in using an internal website to handle administrative tasks such as tracking jobs, etc. Previously I tried doing such with a Swing based approach as this is what I was familiar with, but after consulting with a fellow programmer I chose to set up a webpage with hibernate on top of a MySql db. So far so good, however, I am looking for a crash course in HTML and JSP to get me going.

An excellent guide for this is on IBMs site under JSP Best practices. I am going through this and hopefully it will answer all my questions...

Event Viewer not showing up

What to do if the alerts in your Event Viewer do not show up? Try the refresh button in the Event Viewer. Usually this means there are too many events logged and your Event Viewer is jammed.

This happened to me after installing some new software that bunged up my machine.
Events were being logged like every minute or so and it shortly was overstuffed. I searched the net and found this suggestion in a TechRepublic forum and it worked. After refreshing I could see the events, delete the log if needed and carry on.

Friday, February 09, 2007

Java isDate Date Checker

Here is an earlier post, showing what what proposed as one solution to the Java isDate() problem:

Below is a method that simulates the Visual Basic isDate() method and is configurable so you can set what you consider are the valid date formats.
I modifed this and added the "Jan|Feb|Mar|Apr..." and "JAN|FEB|MAR|APR|..." to the regex so that the method now validates the formats I needed to check.
I am now implementing this in my application.
See this code in my earlier blog post.

However, since that time, I worked out the below which has worked wonders and I am posting it here for any others who may need such a simple tool as to convert a text string to a Date object or check if a string is a valid Date, etc. Let me know if anyone found this useful:

First load the different type of date formats that you consider valid formats:
public static ArrayList loadFormats() {

ArrayList dateFormats = new ArrayList();
dateFormats.add(new SimpleDateFormat("yyyy-MM-dd"));
dateFormats.add(new SimpleDateFormat("yyyy MM dd"));
dateFormats.add(new SimpleDateFormat("yyyy.MM.dd"));
dateFormats.add(new SimpleDateFormat("yyyy-MMM-dd"));
dateFormats.add(new SimpleDateFormat("yyyy MMM dd"));
dateFormats.add(new SimpleDateFormat("yyyy.MMM.dd"));
dateFormats.add(new SimpleDateFormat("dd-MM-yyyy"));
dateFormats.add(new SimpleDateFormat("dd MM yyyy"));
dateFormats.add(new SimpleDateFormat("dd.MM.yyyy"));
dateFormats.add(new SimpleDateFormat("dd/MM/yyyy"));
dateFormats.add(new SimpleDateFormat("dd-MM-yy"));
dateFormats.add(new SimpleDateFormat("dd MM yy"));
dateFormats.add(new SimpleDateFormat("dd.MM.yy"));
dateFormats.add(new SimpleDateFormat("dd/MM/yy"));
dateFormats.add(new SimpleDateFormat("dd-MMM-yy"));
dateFormats.add(new SimpleDateFormat("dd MMM yy"));
dateFormats.add(new SimpleDateFormat("dd.MMM.yy"));
dateFormats.add(new SimpleDateFormat("dd/MMM/yy"));
dateFormats.add(new SimpleDateFormat("dd-MMM-yyyy"));
dateFormats.add(new SimpleDateFormat("dd MMM yyyy"));
dateFormats.add(new SimpleDateFormat("dd.MMM.yyyy"));
dateFormats.add(new SimpleDateFormat("dd/MMM/yyyy"));
dateFormats.add(new SimpleDateFormat("yyyy-MM-dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("yyyy MM dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("yyyy.MM.dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("yyyy/MM/dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("yyyy-MMM-dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("yyyy MMM dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("yyyy.MMM.dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("yyyy/MMM/dd hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd-MM-yyyy hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd MM yyyy hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd.MM.yyyy hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd/MM/yyyy hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd-MMM-yyyy hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd MMM yyyy hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd.MMM.yyyy hh:mm:ss"));
dateFormats.add(new SimpleDateFormat("dd/MMM/yyyy hh:mm:ss"));

return dateFormats;

Next create your isDate method:
public static boolean isDate(String date) {

// Load formats
ArrayList dateFormats = loadFormats();

boolean validDate = false;
Date myDate = null;

Object[] myDateFormats = dateFormats.toArray();
for (int i = 0; i
SimpleDateFormat myFormat = (SimpleDateFormat) myDateFormats[i];
try {
myDate = myFormat.parse(date);
validDate = true;
catch (Exception e) {
validDate = false;
return validDate;
Here is a converter from String to Date to ensures it is a valid format:
public static Date convertStringToDate(String date) {

// Load formats
ArrayList dateFormats = loadFormats();

boolean validDate = false;
Date myDate = null;

Object[] myDateFormats = dateFormats.toArray();
for (int i = 0; i < myformat =" (SimpleDateFormat)" mydate =" myFormat.parse(date);" validdate =" true;" validdate =" false;">
There you go!

Java isDate() Function Anyone?

From my earlier "pre-category" blog, this post was written when deep into the lack of a Java isDate() function:

New to Java, but learning fast and definitely experiencing the "Paradigm Shift", however I find it odd that there is no isDate() function like there is in Visual Basic.

As mentioned in an earlier post, I ran into a Java Date Conversion issue where the date format that I was expecting and turned out in the live data tests, to be a different format. This was after building over 500 test cases that test each of the business rules programmed into a Drools Rules (now Jboss) processor.

So obviously, I want to have all possible date formats accounted for so the solution is to have a function that loops through each of the possible date formats and check if the string parses into a valid Date object using that particular SimpleDateFormat.

I know this can be done using try-catch blocks, but I want to create a more elegant approach calling my custom isDate() function.

Checking aroung Javalobby and other places to see if anyone has already solved this, or if I need to create from scratch.

Jigloo Eclipse Windows Builder?

Building GUIs for rich client apps in Eclipse is not the easiest thing in the world. I'm looking for windows builders comparable to the Net Beans Matisse GUI builder.

I've downloaded the Jigloo GUI builder which sounds pretty good but that is coming from company who made it.

What about WindowsBuilderPro? Both Jigloo and WindowsBuilder are commercial products that offer free trial editions. I've generally heard that WindowsBuilderPro is the better product.

I'll try both and see for myself...

On the install, Jigloo is very simple and is just like any other Eclipse plugins. Just copy to the features and plugins directories and your done. WindowBuilder on the other hand has you walk through a install wizard which took only a couple minutes to run through. (Make sure you don't have Eclipse running during the install).

WindowsBuilder has several example apps you can try. In 5 minutes of testing this builder I found it very user friendly. I liked the "quick view" feature where without even compiling you can get a feel for how your app will look during runtime. Now on to Jigloo.

After the few minutes I found it easy to create a basic app. Nothing difficult. I liked the two screen (code/app) view that shows the two-way changes so if you edit the GUI the code changes or you edit the code and the GUI changes.

So far they are both comparable products and I'll spend a bit more time using each on an actual application to see which is the better product. ...

After spending several hours using Jigloo and writing over 1,000 lines of Java using GUI parts from Jigloo I've come to the conclusion that the best use of a GUI builder is for the initial creation of your form and widgets. The size, arrangement of widgets and placement of your GUI can be very much sped up using Jigloo.

However, once you understand what a GUI builder is doing and have your basic form created, continued use of any GUI builder is more annoying than anything. I'd recommend using one to get started, get your basic form created and then drop kick it for any later modifications or additions.

Star UML Design Tool

Ok, I downloaded the Star UML Design tool. Looks like an open source UML tool. Sounds promising. I'll give it a review in a bit.

As far as basic modeling tools go, this is the most easy to work with, simple to understand and fastest startup time. I've used this to generate a myriad of Use Case Diagrams, Activity Diagrams and Sequence Diagrams. Not reviewed yet is Star UML's Code Generation capabilities which look pretty straightforward.

One thing a bit confusing to a UML/MDA beginner is the "New Project By Approach" window where you have to select which UML approach you want to use. This basically means how the different types of diagrams will be organized. You have the option of selecting the 4+1 View Model, the Rational Approach, the UML Components Approach or the Default Approach.

I recommend just using the Default Approach unless you are familiar other standard UML approaches.

Eclipse Visual Editor vs. Jigloo

Ok, I've been using the Eclipse Visual Editor for all of 15 minutes and I am sold. Up until now, I've been using Cloudgarden's Jigloo free version, yet after getting constantly annoyed by Jigloo's constant modifications to my code, prevention of using builders and the incessant warnings about how you are using the free version and that this can not be used for commercial use, I had it.

It is much easier to build with the Model View Control (MVC) architecture using the VE tool as opposed to the Jigloo builder.

Here are some comparisons and reviews of these GUI builder tools:

Jigloo Review
Comparison of Java GUI Builders
Comparison with Matisse
IBM Giving Away VE Source Code to Eclipse

Refresh a JTable from a JDBC connection

I have a Swing application that uses a JTable and JScrollPane to display data from an Access database via a JDBC connection.

Even though I had properly coded the SQL INSERT statement and the data was loading into my database, the JTable was not refreshing with the new record. I tried the .revalidate() and the .repaint() methods to no avail.

Finally, after reading Ben Hendry's Mastering JTable I was able to sort this out.

In my JButton ActionListener I added the TableModel.addRow method and inserted the newly inserted records into the JTable as well. Followed by a .revalidate() and the data properly displayed without even a flicker in my application.

Moral of the story - always use Model-View-Control (MVC) architecture and separate your view from your data and controls.

Tom Cruise, Baby Suri, Photos Released

As you can see, Suri is absolutely a beautiful child. Vanity Fair has a major feature on Tom Cruise, Katie Holmes and Suri Cruise. This article dispels the rumors that have been generated by the media over the last several months.

Marilyn Monroe - Death By Psychiatry?

On the CCHR website there is a section devoted to artists. As I am an artist this was struck home to me. Artists and successful people are often found surrounded by "professionals" in the field of mental health.

I've heard many stories about what happened to Marilyn Monroe but this, from CCHR's website makes the most sense. Here is an excerpt:

"In 1960, Monroe saw psychiatrist Ralph Greenson, whose control over her was swift, severing all her close relationships. By 1962, she realized—too late—that she must "disconnect from Greenson." After spending six hours with him, she was found dead of a drug overdose. In the seven years prior to psychiatry’s influence, Monroe had made 23 movies. In the seven years of her psychiatric "care," she only made six films."
Check out the Citizen's Commission on Human Rights for more information on other famous artists whose lives have been destroyed by Psychiatry in the name of help.

Wednesday, February 07, 2007

Probabilistic Matching

Probabilistic matching uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision.

Probabilistic matching enables one to use match scores and percentages on a field by field comparison to determine a match. There are three categories output from probabilistic matching and are set by the user based on the overall probability percentage:

1) Match - that can be automatically merged
2) Candidate Match - requiring manual review
3) Non Match

Artificial Intelligence and Machine Learning utilize probabilistic matching.

Tuesday, February 06, 2007

Service Oriented Architecture (SOA)

A service-oriented architecture is a collection of services that communicate with each other. The services are self-contained and do not depend on the context or state of the other service. They work within a distributed systems architecture.

Record Linkage

Record linkage is the act of detecting and linking duplicate records within a database. It is also called match/merge, deduplication, match/consolidate, etc.

Some record linkage software tools include:
Febrl (Freely Extensible Biomedical Record Linkage) - Open Source
MatchIT - Commercial
Business Objects - Commercial
Group1 - Commercial

Fuzzy Matching Logic

A method of matching where algorithms are used to compare character strings and detect recurring patterns within these strings. Common algorithms are Soundex, Levenshtein Distance and Double Metaphone which find similarities based on patterns and similarities. It is called a "fuzzy match" because the match is not exact, like if you were looking at the two records out of focus, they would look the same, but would be a bit fuzzy.

Enterprise Resource Planning (ERP)

Software that integrates departments and functions across a company into one computer system. ERP runs off a single database, enabling various departments to share information and communicate with each other. It helps a manufacturer or other business manage the important parts of its business, including product planning, parts purchasing, maintaining inventories, interacting with suppliers, providing customer service, and tracking orders.

Customer Data Integration (CDI)

Customer Data Integration (CDI) is the consolidation and management of customer data from all available sources. This customer data typically includes contact details, value data, and other information collected across multiple channels and through various interactions. The purpose of CDI is to keep source data systems and the central database in synch with each other to provide an accurate central view of each customer

Customer Relationship Management (CRM)

The functions and programs a company uses to connect with its customers - typically divided into Operational CRM (call centers, sales force automation, supply chain management) and Analytic CRM (customer analysis, database marketing). It allows an individual approach to marketing to and servicing customers.

Sunday, February 04, 2007

CDI - Glossary of Terms

As I've mentioned before, a primary barrier to studying any subject is the misunderstood word. Per Hubbard's study technology site:

"To clear a word, one looks it up in a good dictionary.

"The first step is to look rapidly over the definitions to find the one which applies to the context in which the word was misunderstood. One reads the definition and uses it in sentences until one has a clear concept of that meaning of the word. This could require ten or more sentences.

"Then one clears each of the other definitions of that word, using each in sentences until one has a conceptual understanding of each definition.

"The next thing to do is to clear the derivation – which is the explanation of where the word came from originally. This will help gain a basic understanding of the word." ...
- L. Ron Hubbard
Therefore I've started putting together simple to understand definitions for the most common terms in the field of Customer Data Integration.
Customer Relationship Management (CRM)

Customer Data Integration (CDI)
Enterprise Resource Planning (ERP)
Fuzzy Matching Logic
Probabilistic Matching Logic
Record Linkage

Service Oriented Architecture (SOA)

Monday, January 29, 2007

Record Linkage Graphical User Interfaces

Manual handling of duplicates in a database can be quite time consuming. It is important to find the right tool to help you speed this process. Or if you are building your own record linkage tool, it is always good to see what is out there, how GUIs are laid out and what the common features are.

The Link King:

The Link King’s graphical user interface (GUI) makes record linkage and unduplication easy for beginning and advanced users. The data linking neophyte will appreciate the easy-to-follow instructions. The Link King's artificial intelligence will assist in the selection of the most appropriate linkage/unduplication protocol.

Linkage Wiz:

is a powerful data matching and record de-duplication software program used by businesses, government agencies, hospitals and other organisations in the USA, Canada, UK, Australia and France. It makes it easy to link records across multiple databases and to identify and remove duplicate records within databases
TAILOR is extensible, and hence any proposed searching method, comparison function, decision model, or measurement tool can be easily plugged into the system. We have proposed three machine learning record linkage models that raise the limitations of the existing record linkage models. Our extensive experimental study, using both synthetic and real data, shows that the machine learning record linkage models outperform the probabilistic record linkage model with respect to the accuracy and the completeness metrics, the probabilistic record linkage model identifies a lesser percentage of possibly matched record pairs, both the clustering and the hybrid record linkage models are very useful, especially in the case of real applications where training sets are not available or are very expensive to obtain, and Jaro's algorithm performs better than the other comparison functions.

The following three screen snapshots are the basic screens of TAILOR graphical user interface. The first screen allows the user to either generate a synthetic experiment using DBGen, perform a real experiment on a database, or repeat a previous experiment knowing its data files. The user then uses the second screen in order to select a searching method and a comparison function and tune their required parameters. Finally, the third screen allows the user to select the decision model he would like to apply and outputs the values of the measures if the experiment is on synthetic data.

Download PDF for screenshots of TAILOR here.

MatchIT -
matchIT incorporates our proprietary matching algorithms to ensure phonetic, miskeyed and abbreviated variations of data are detected. Results can be verified using comprehensive data auditing functions, drilling down to suspect data, identifying data anomalies, and filtering garbage and salacious words.

Fuzzy Dupes 2007

Did you know that your contact database typically contains 3-10% duplicates ?

These duplicate records result in unnecessary costs when sending out printed catalogs, are aggravating to your customers, create problems in the controlling, etc. With classical methods you have no possibility to locate these duplicates in your database.

Sunday, January 28, 2007

Machine Learning and the Hidden Markov Models (HMM)

What is a Hidden Markov Model? (from

A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. A HMM can be considered as the simplest dynamic Bayesian network.

In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.

Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition and bioinformatics.

 State transitions in a hidden Markov model (example) x — hidden states y — observable outputs a — transition probabilities b — output probabilities
State transitions in a hidden Markov model (example)
x — hidden states
y — observable outputs
a — transition probabilities
b — output probabilities


Machine Learning Links

Record Linkage and List Quality

What is Record Linkage:

Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the “standard” probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.

Other names that mean the same thing: entity heterogeneity, entity identification, object isomerism, instance identification, merge/purge, entity reconciliation,
list washing, match/consolidate and data cleaning. I like the term "record linkage" and will refer to it as such from this point forward in this blog.

Seems clear that if you want to be thorough in your record linkage efforts you would implement a combination of deterministic and probabilistic matching methodologies. Using a straightforward name and address match such as the Firstlogic (now Business Objects) approach will usually be sufficient if you are a list vendor or mailhouse. But if you are at all serious about identity management you will step into the deep end and implement a probabilistic matching method. I've downloaded the Ferbl open source probablistic matching tool, but have yet to experiment with it.

But whatever method you use or software package you buy, the quality of your record linkage always ends up in how good you have configured your rules. This is not an off-the-shelf solution - it requires work.

I bet most organizations "record linkage problems" could have been avoided if enough forsight and initiative had been put on the original database systems, in establishing that unique key.

Well, I guess hindsight is in fact - 20/20.

One product that definitely contributes to an increased percentage of record linkage is the SSA-NAME3 product. As you'll see from their site, they have developed this name search tool which uses probabilistic matching but factors in how common or uncommon the name is. For example matching two Jose Garcia's in the city of Los Angeles is not the same as matching two Jose Garcia's in Iceland. They are probably the same person in Iceland and most likely not, in LA.

What is a Data Steward?

Data Steward defined:

The person responsible for a data standard. In this role, a Data Steward is charged by his/her Management to develop and maintain the data standard and to counsel Service personnel on the proper use of the data. He/she must: have a thorough knowledge of the subject matter of the standard, provide accurate and current electronic copies of data relevant to the standard, and weigh the pros and cons of comments received during review of the standard. He/she is authorized to defend or modify the standard as necessary in order to ensure its proper use.
Claudio Imhoff's definition and description of duties:
Steward - from Old English for "keeper of the sty", a sty ward.

Data Steward - Person responsible for managing the data in a corporation in terms of integrated, consistent definitions, structures, calculations, derivations, and so on.

Corporations are demanding better and better sources of data. The explosive growth of data warehousing and sophistication of the access tools are proof that data is one of the most critical assets any company possesses. Data, in the form of information, must be delivered to decision-makers quickly, concisely and more importantly, accurately.

The data warehouse is an excellent mechanism for getting information into the hands of decision-makers. However, it is only as good as the data that goes into it. Problems occur when we attempt to acquire and deliver this information. A major effort must be made in defining, integrating and synchronizing the data coming from the myriad operational systems producing data throughout the corporation. Who should be responsible for this important task? The answer for a growing number of companies is a new business function called Data Stewardship.

What is Data Stewardship?

Data Stewardship has, as its main objective, the management of the corporation's data assets in order to improve their reusability, accessibility, and quality. It is the Data Stewards' responsibility to approve business naming standards, develop consistent data definitions, determine data aliases, develop standard calculations and derivations, document the business rules of the corporation, monitor the quality of the data in the data warehouse, define security requirements, and so forth (see Table 1 for a list of the data integration issues determined by Data Stewards).

This data about data, or meta data, developed by Data Stewards can then be used by the corporation's knowledge workers in their everyday analyses to determine what comparisons should be made, which trends are significant, that apples have indeed been compared to apples, etc.

Just as the demand for a data warehouse with good data has grown, the need for a Data Stewardship function has likewise grown. More and more companies are recognizing the critical role this function serves in the overall quest for high quality, available data. Such an integrated, corporate-wide view of the data provides the foundation for the shared data so critical in the data warehouse. ...

Data Stewards are responsible for the following:
* Standard Business Naming Standards
* Standard Entity Definitions
* Standard Attribute Definitions
* Business Rules Specification
* Standard Calculation and Summarization Definitions
* Entity and Attribute Aliases
* Data Quality Analyses
* Sources of Data for the Data Warehouse
* Data Security Specification
* Data Retention Criteria
Crash Course on Data Stewardship:

Data Stewardship programs are implemented to reduce information technology costs and improve the value companies gain from their data assets. Stewardship programs focus on improving data quality, reducing data duplication, formalizing accountability for data, and improving business and IT productivity. An effective Data Stewardship program will rapidly improve the ROI from data warehousing and business intelligence efforts

Data Governance - IT or Business?

"Where does Data Governance fit into the Organization?"

This question appears to have two answers. "In Business" or "In IT". The answer I get most often when I ask the question is ... "In Business". Or "Business should own Data Governance".

I wish it was that simple.

What exactly does it mean for Data Governance to fit "In Business"?

This article seemed to take up the questions posed in my last post. The last paragraph sums it up and answers this question.

The best answer to the question "Where does Data Governance fit into the Organization?" is "It doesn't matter". Data Governance can be successful when managed by a business area or by an IT area.

The decision of who will manage the Data Governance program can be very important to the success of the program. However, it will not necessarily make or break a well-defined Data Governance program's likelihood of success. As long as the business areas and IT areas coordinate their efforts, use a Data Governance Council as a strategic resource, cooperate in strategic data management activities, and act in the best interests of the organization (data-wise), the placement of the management of the Data Governance program is not nearly the most important question that needs to be answered.

Wednesday, January 24, 2007

IT Departments Role in CDI

Introducing Customer Data Integration into an organization can be a bit daunting. A recommendation to anyone doing this is Jill Dyche / Evan Levy's book: Customer Data Integration which definitely builds the needed foundation for anyone's understanding of this subject. However, there are still aspects that need to be understood and certain functions clearly delineated.

What is the role of IT in CDI (Customer Data Integration)?

First a definition of IT:

information technology
n. Abbr. IT
The development, installation, and implementation of computer systems and applications.
"IT." The American Heritage® Dictionary of the English Language, Fourth Edition. Houghton Mifflin Company, 2004. 24 Jan. 2007.>
The development, installation and implementation of the CDI system is the responsibility of IT - Not the day-to-day upkeep of the data. As a Data Steward you must have the needed access to the data in order to uphold your responsibilities.

Tight coordination is needed between the Data Stewards and the IT Dept when the CDI project is in its early stages. Clear definitions should be established for the roles needed, responsibilities assigned and the needed functionality and access levels built into the CDI project.

Getting this established and clearly delineated at the onset is critical to a successful CDI implementation and will increase the mileage and speed of progress in achieving your goals with CDI.

Friday, January 19, 2007

Postalsoft Mailing Software: Compare Solutions

Here is a good comparison of the Firstlogic mailing products. I thought I had the professional suite but in looking at these it seems I am only getting the Business Edition options. I better call my account rep...

Postalsoft Mailing Software: Compare Solutions

Thursday, January 18, 2007

Oracle TO_CHAR function

The oracle TO_CHAR function can display seconds past midnight with the following syntax:

SELECT TO_CHAR(sysdate, 'SSSSS') FROM dual;

Further Oracle SQL/PLSQL topics

Monday, January 15, 2007

Machine Learning and Artificial Intelligence

Machine Learning is a topic that has come up a lot during my CDI (Customer Data Integration) research. Specifically when diving into the domain of Probabilistic Matching. From the Initiate website:

Probabilistic matching

Probabilistic matching uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links than deterministic systems between records that have more complex typographical errors and error patterns.

Typically, probabilistic systems assign a percentage (such as 75 percent) indicating the probability of a match. Because these systems pinpoint variation and nuances to a much finer degree than a deterministic approach, they are better suited for businesses that have complex data systems with multiple databases. Due to the size of these data systems, the potential for duplicates, human error and discrepancies is far greater, making a system designed to establish links between records with complex error patterns much more effective.
Probabilistic matching enables one to use match scores and percentages on a field by field comparison to determine a match. There are three categories output from probabilistic matching and are set by the user based on the overall probability percentage:

1) Match - that can be automatically merged
2) Candidate Match - requiring manual review
3) Non Match

The topic of Machine Learning and artificial intelligence enters in with the manual review process of the candidate matches. The idea is that computer can learn from the users decisions of what was manually determined to be a match or non-match and build these into its future decisions and probability scores.

If an expert system--brilliantly designed, engineered and implemented--cannot learn not to repeat its mistakes, it is not as intelligent as a worm or a sea anemone or a kitten.
-Oliver G. Selfridge, from The Gardens of Learning.

"Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever."
- Oliver G. Selfridge, in AI's Greatest Trends and Controversies

Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously self-improve and thereby offer increased efficiency and effectiveness.

(read full article on Machine Learning here)

Companies like Purisma and Siperian offer machine learning techniques built into their software.

A primary goal of machine learning would be to reduce the amount of manual review needed to determine matches and continuously improve the software's ability to accurately detect and consolidate duplicate records.

An open source tool that can be used for probabilistic record matching is Febrl (Freely Extensible Biomedical Record Linkage). Written in Python anyone can download this from and get your feet wet with probabilistic matching.

Tuesday, January 09, 2007

An SQL Introduction and Tutorial

If you need to freshen up your basics on SQL or if you need to train a new employee on a gradient approach to SQL, the following is a suggested training line-up:

First, clear up each of the following terms:


Next, go through this online interactive tutorial. This has the student walk through real examples using an online SQL tool so you can practice what you learn online without having to go to another computer where your database is.

An Interactive SQL Tutorial

In learning any new subject it is extremely important to start with the key words and to have these cleared up first, before carrying on with the material. This is covered in Hubbard's, Study Technology.

Here are good basic definitions for each of these SQL terms:

SQL stands for Structured Query Language and is a computer language that allows you to ask questions (query) or interact with your database in a structured manner.

The SELECT statement indicates that you wish to query and retrieve information from a database. The select_list specifies the type of information (or column names) to retrieve. The keyword ALL or the wildcard character asterisk (*) could be used to signify all columns in the table. Also, the keyword DISTINCT could be used to discard duplicate records and retrieve only the unique records for the specified columns.

The INTO keyword indicates that you are inserting records into another table. For example in Microsoft Access you can create a new table and insert all records from another table by using the following: SELECT * INTO newtablename FROM tablename

The FROM clause is required in any SELECT statement. The FROM clause specifies the specific tables to retrieve data from.

The AS keyword is used in the SELECT or the FROM clause when you want to refer to a table or column as something else. This is very handy if you have long table or column names and you want to use a shorter name to refer to them. This is commonly used when you are joining multiple tables together and do not want to rewrite the full table name each time you are referring to that table. For example:

SELECT a.column, b.column
FROM table_name_a AS a, table_name_b AS b

This is also commonly used to have a nicer display of your results. For example if you have a column called "field17" and you want to display it as "COUNTRY" you can use the following:

FROM table_name

The WHERE clause limits the results to those records (or rows) that meet some particular conditions (optional).

LIKE is another SQL keyword that is used in the WHERE clause. LIKE allows you to search based on a pattern rather than specifying exactly what is desired The syntax for it is as follows: SELECT * FROM table WHERE column LIKE 'abc%'
The wildcard '%' (in Microsoft SQL and Oracle SQL) or '*' (in Access SQL) indicates any character and can be used in any location within the quotes. This will return any records that match 'abc' as the first three letters and then anything else after that such as 'abc123', 'abcdef', etc.

The IN keyword is used in the WHERE clause when you want to select all records that have specific values such as if you wanted all customers that have either blue, green or hazel eyes you would use the following:

FROM customers
WHERE eye_color IN ('blue', 'green', 'hazel')

You can not use wildcards with the IN keyword.

If you want to specify a range of values you can use the BETWEEN keyword. This will give you anything between value a and b values. For example, if you wanted any zip codes that were between '90210' and '91420' you would use the following:

FROM address_list
WHERE zip_code BETWEEN '90210' AND '91420'

The AND keyword is used when you want to add additional criteria to your WHERE clause or as part of the BETWEEN clause.

The OR keyword is used when you want either one criteria or another, but not both. Be very careful when using the OR keyword as if you have AND and OR in the same WHERE clause your result will probably not be what you expect, unless you use parentheses (). For example, if you want all people who with blue or hazel eyes that live in California, you would use the following:

FROM address_list
WHERE (eye_color = 'blue' OR eye_color = 'hazel') AND state = 'CA'

If you don't use the parentheses your query will return people in California, people with blue eyes or people with hazel eyes. So remember your math class where you learned to group parts of the equation to avoid confusion (a * b) + y ...

The GROUP BY clause specifies if you are grouping (or aggregating) any of the columns in your SELECT statement. For example if you want to display a count of all addresses by city you would use the GROUP BY clause to group the results by City. This following example will give you a breakdown by city and sort it by most to least:

FROM table

The HAVING clause specifies the specific conditions to group by (optional). For example you may want to group your results by City but only display those Cities that have more than 10 matches. You would do this by saying "HAVING COUNT(*) > 10". HAVING always comes after the GROUP BY clause.

The ORDER BY clause specifies whether to output the query result in ascending or descending order. This is often used to sort your results in alphabetical or numerical sequence.

The COUNT keyword is used in the SELECT or HAVING clauses. This allows you to give a total count of something. This is also referred to as an "aggregate". This means you are grouping things together and displaying aggregated data on the group instead of listing each individual record. Other aggregate keywords are SUM, MIN, MAX, etc. COUNT gives you the count of records, SUM gives you the SUM of the values in a column, MIN gives you the minimum value, MAX gives you the maximum value. Here is an example of the COUNT usage:

FROM table_name
WHERE last_name = 'SMITH'

This will give you a count of all records with the last name of 'SMITH'. Notice the (*) after the keyword. This means you want to give a count of all records meeting the criteria in your WHERE clause. But you may also see COUNT(1) which also means give a total number of the records. The COUNT keyword is very useful in giving reports or breakdowns of various columns. For example if you want a breakdown of all records you have for each eye_color (if you have this column in your table) you could use the following:

SELECT eye_color, COUNT(1)
FROM table_name
GROUP BY eye_color

Notice the GROUP BY and ORDER BY keywords used as well. This will show you the count by eye_color and will sort it by the the highest number to the lowest.
Here is another simple introduction to SQL.