Record Linkage: December 2006

Sunday, December 31, 2006

Awk Tab Delimiter (FS="\t")

Quick guide to Awk Field Separators (FS):

Tab Delimited: FS="\t"
Comma Delimited: FS="," (Beware of CSV files. See earlier post on this.)
Pipe Delimited: FS="|"

Friday, December 29, 2006

Organizations with a central HQ and regional offices, must have some form of CDI in order to accurately understand their customer base. Otherwise you are left with disparate sets of disrelated data, unlinked and unconsolidated, making analysis and effective marketing nearly impossible. So once CDI as a topic is understood, the next question becomes - Build or Buy? After travelling down the "build" path for 2 years I wish I had researched and implemented a commercial product just to avoid the headaches alone...

However, with all that water under the bridge, it would be nice to be able to simply "plug-in" the missing parts to make my CDI solution complete.

I recommend buying a solution outright, even though the cost may be overwhelming at first, it will pay off if your organization is doing any volume of mailing - and will probably pay itself back in postage saved alone.

Here are the key questions that should be asked of a CDI vendor if you are looking to buy:

* Since systems rarely share identifiers, does the solution score and match data taking advantage of all attributes that aid in the matching process using likelihood statistical theory for the highest levels of accuracy? State-of-the-art systems use "probabilistic algorithms" to compare attributes one by one and produce a weighted score for each pair of records. The higher the score, the more likely the records represent the same person. This method improves matching accuracy by using weights that are specific to an organization's data.

* Does the vendor provide analysis based on your own real-world data? Premier providers conduct frequency-based file analysis to provide weighting and thresholds specific to a customer's data. By using your own data, a better match is possible than by using generic criteria and arbitrary weights for each attribute. The data can be studied and tuned to recognize that a match on John Smith in the Bronx, NY, contains a more trusted matching value than a match on John Smith in Minnesota.

* Does the provider offer a way to capture and maintain a complete history of changes to attribute values? Such a method of "complete versioning" improves accuracy and has the ability to make the correct decisions on linking data even as it constantly changes due to marriage, divorce, new address, new phone number, etc.

* Does the solution enable you to establish the right accuracy and identification levels for your application and budget? Managing data quality is a business decision, so consider whether you need automated identification of results to ensure faster customer service, for example, or a solution that allows manual review of data to capture fraudulent customer activity, such as duplicate insurance claims.

* Can the solution overcome multiple identifications and data discrepancies across sources? Common obstacles to customer identity include transpositions, misspellings, nicknames, aliases, address inconsistencies and identity misrepresentations, all of which can occur when customer data flows into the company through multiple touchpoints. A strong CDI product should be able to keep data up to date and synchronized across the enterprise.

* Can you use the same matching technology in order to maintain all your relationships: for example, individuals, households and companies?
(read rest of article here)

How to create that Golden Master

The concept of a centralized view or 360 degree view of a customer is often strived for but the "name, want and get" is not always figured out. Most companies will name what they are looking for, some fall short in getting the board to want the solution and many lose it on figuring out how to get it done.

VP of Siperian describes the need for assigning reliability factors to the different sources of data as some sources may have more accurate data:

In order to deliver a “golden” or master record for each customer and its various affiliations, a system must dynamically assess reliability across all data sources – based on user-defined parameters and intrinsic data properties – and ensure that only the most reliable content survives at the cell-level of the master record. For instance, if the call center begins collecting email addresses when confirming orders, this data attribute may be more reliable than the email addresses submitted by customers at the website. The ability to rapidly adjust the system to survive the call center email address over the website email address is a critical architectural component of any CDI system. Moreover, such cell-level survivorship for data reliability must be built into the core product architecture and should not be sacrificed as the customer hub scales to millions of customers. Ultimately, how well the end-users accept a customer data hub depends on sustaining high level of data reliability, even as the hub grows in volume or as new data sources are added.

In looking into different CDI solutions, you must look into how the data hub is structured, if they have configurable rules for loading and merging and if they have a robust Data Steward. Realize that data stewardship is the way all 'exceptions' get handled. If you do not separate out the confident updates from the questionable ones, you will have a poor foundation that will eventually lead to a polluted database.

A business cannot implement an operational customer hub in the absence of data stewardship ― as soon as data begins to flow through the supported business processes, exceptions and errors begin to flow as well. Therefore, any customer hub acting as a data integration platform must offer business capabilities to monitor and handle such exceptions either by business analysts and end-users or by anyone designated as a data steward.
...
Often managing such exceptions requires full user interfaces for complex data stewardship tasks such as investigating the history and lineage of certain master records in question. This may be the only way to ensure that the user acceptance and data hub reliability remains high. In other circumstances, an enterprise may choose to build a specific user interface against the programming interfaces of the master hub in order to suit its needs. In either case, an adaptive solution must deliver rules-based configurability with best-in-class data stewardship consoles as well as programming interfaces to handle all data exception and reliability needs.

I've worked on custom CDI projects before CDI was called CDI and can attest that the Data Stewardship interface will make or break your CDI project. This should not be underestimated.

The last point data export back to original sources. This is where you exchange your centralized view with the sources and provide them with your expanded view to assist them in better understanding your customers. If you have done a good job, they will love you, if you haven't you will be public enemy number one. They think their data is "better than yours", "why did you change my address", "you merged two customers into the same one!" and other rantings will be sent your way if you have not done a good job.

You should have a well thought out strategy including what data elements you will send back to sources and you should allow them to decide whether or not they want your data.

Often, certain data attributes (such as privacy preferences) need central control and exception handling whereas other attributes are best left under local management. In addition, security and access to data attributes in the hub will vary by individual roles within each organization and by organization at large. In fact, to support the broad range of business requirements across business lines, there may be multiple data governance regimes required for different data attributes, all within a single enterprise.

An adaptive approach must be based on a distributed architecture whereby multiple hubs can be deployed to integrate different data sources and support different processes, yet be able to share data across one another based on any number of hub-and-spoke or peer-to-peer governance regimes. This offers a line of business yet another dimension of flexibility to share some but not all data – each based on its own data reliability and governance policies. With full rules-based configurability and data stewardship interfaces, a broad range of data governance regimes can be supported.
In summary, data must be the central focus of any CDI project, and data reliability, stewardship and governance regime must never be an after-thought of a comprehensive CDI solution. Enterprises must closely review the details of a vendor’s offering in this area to reduce their risk of project failure and the total cost of CDI implementation.

Friday, December 22, 2006

Customer Data Integration (CDI) Can't Decipher It

I'm drowning in the sea of Customer Data Integration solutions, Data Librarians, Data Stewards, Registry hubs, Persistant hubs or Hybirds. Above is a good review of the basics of the problem domain. Key players are the Oracle Customer Data Hub, Siperian, Purisma and possibly the recent Sun aquisition - SeeBeyond.

"Oracle leads in customer data integration. Oracle has assembled the broadest set of components, from data quality, to matching, to customer key management, to process integration, to ongoing monitoring and management."—Forrester Research

Customer knowledge is the lifeblood of the enterprise. And customer knowledge begins with a unified, reconciled source of high-quality customer data—provided only by the Oracle Customer Data Hub.

Internet seminar: Get A Single View of Your Customers

Internet seminar: Practical Guide to Customer Data Politics

Oracle Customer Data Hub is a fully integrated customer data management solution that centralizes, de-duplicates, and enriches your customer data, continuously synchronizing with all your data sources, to give you a single view of your customers. As new data comes in, reporting accuracy grows, analytics become more valuable, employee productivity increases, and day-to-day customer relationships improve.

Reading Jill Dyche's new book, Customer Data Integration is my life preserver. Talking to salespeople has led to further confusion and contradictory statements about each others products.

Gartner's findings of May 2006 on CDI may already be outdated with the fastly evolving market.

Thursday, December 21, 2006

World's Oldest Computer - Antikythera Mechanism

Ancient computer found off the coast of Antikythera, Greece is estimated to have been built over 2000 years ago. This makes it the world's oldest known computing device.

Above is a reconstruction of this device is believed to look like. There are over two thousand greek characters inscribed on this machine which helped archeologists decipher what it did and how it worked.

From a recent article in All Headlines News:

"...a device known as the Antikythera (an-tee-KITH-air-uh) mechanism, a device consisting of thirty bronze gears with approximately two-hundred-twenty-five teeth, was discovered off the Greek Island of Antikythera.
The Antikythera mechanism, which was made in Greece, is the earliest known computer, though research has shown that the mechanism was not unique for its time. Other instruments similar to the Antikythera have been mentioned in historical documents written in first-century B.C. by Cicero and Posidonius, the teacher of Cicero.
Scientists believe the Antikythera mechanism may have been used in the field of astronomy, showing the motion of the planets and computing eclipses of the sun and moon."

Here is a photo of what one of the pieces of this computer actually look like:

Other Antikythera links:

Reconstuction of Antikythera Mechanism
Antikythera Animation of gears
Pictures of mechanism
An Ancient Greek Computer
Oldest Known Computer

Monday, December 18, 2006

Oracle SQL Function - INSTR()

I found it. The use of INSTR() simulates the AWK Function INDEX() and does a search of the first argument to see if the second argument exists and then returns the position where this is first found. Here is the syntax:

INSTR(str, chars[,s[,n]])
Find the 'n'th occurence of 'chars' in 'str'
Starting at position 's'
n and s default to 1

This was decribed in the following page which lists out all the Oracle SQL Functions and is a very useful link:

Oracle SQL Functions

To see a full listing of all Oracle SQL Functions you can run the following query:

SELECT distinct object_name
FROM all_arguments
WHERE package_name = 'STANDARD';

Looking for Index() function in Oracle SQL

From my blog you probably can tell I have an infatuation with Awk. Well today I was trying to work out a problem in Oracle SQL where the index() function would have worked perfectly. I'm checking around to find if there is any built in function that searches an existing column for a string and returns the starting point if it exists. Oracle SQL does have the substr() function which I've found quite useful...

Wednesday, December 06, 2006

Download Free Awk Book

Here is a link to download a free copy of the Gawk - Effective Programming Language book.

Awk Executables for Windows

Here are some free downloads of AWK that you can install on a windows machine

AT&T Original Awk
Other Awk Downloads

What is AWK?

Now that I am writing about AWK I thought I should start off with a descriptive definition of this language. (See full description here):

awk is a simple and elegant pattern scanning and processing language. It was created in late 70th of the last century. The name was composed from the initial letters of three original authors Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. It is commonly used as a command-line filter in pipes to reformat the output of other commands. It's the precursor and the main inspiration of Perl. Although originated in Unix it is available and widely used in Windows environment too. It takes two inputs: data file and command file. The command file can be absent and necessary commands can be passed as augments. As Ronald P. Loui aptly noted awk is very underappreciated language:

what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a "scripting language." It has no objects; it is
not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't
even been close (we have not seen work in PROLOG or JAVA).
The main advantage of AWK is that unlike Perl and other "scripting monsters" that it is very slim without feature creep so characteristic of Perl and thus it can be very efficiently used with pipes. Also it has rather simple, clean syntax and like much heavier TCL can be used with C for "dual-language" implementions.

How to parse a CSV file in AWK

Okay, so I am not new to Awk - been using it for 3 years. One delimeter type I tend to stay away from is .csv files. I deal with a lot of address data so if you try to do the obvious thing and set the FS = "," you are going to run into a lot of incorrectly parsed fields as address data sometimes includes commas in the street address.

So when I am dealing with a .csv file it is going to have quotes around each field and then a comma between each field.

So then you might say, why not set up the Field Separator like this: FS = "/",/"" which translates into the delimiter looking like ",".

Great, but the first field $1 will be left with a quote in the beginning as awk was not programmed to understand anything differently.

So I am looking for a simple way to parse .csv files using awk.

Here is all I've found so far:

Handling CSV data and a complicated example.
Google Group on AWK CSV parsing
Another Thread related to AWK CSV parsing

From all of this the most workable solution so far is:

function setcsv(str, sep) {
gsub(/[\\"]"/, "\035", str)
while (match(str, /"[^"]*"/)) {
middle = substr(str, RSTART+1, RLENGTH-2)
gsub(sep,"\036", middle)
str = substr(str, 1, RSTART-1) middle substr(str, RSTART+RLENGTH)
}
if (index(str, "\"")) {
if ((getline) <= 0)
return 0
setcsv(str "\n" $0, sep)
}
else {
gsub("\035", "\"", str)
gsub(sep, "\034", str)
gsub("\036", sep, str)
$0 = str
return 1
}

}

Introduction to Awk

I'll be posting a bit more on AWK as this is a language I use all the time and I am constantly referring back to my MKS Toolkit manual for that certain syntax or to see how exactly that index command works again. Below are some useful links of AWK manuals or tutorials to get started using AWK.

An Awk Primer

Getting Started with Awk

The GNU Awk User's Guide

Awk is your friend

I'll get more into the FS command and try to overcome some of the annoyances I've had with it in my next posts.

Tuesday, December 05, 2006

Scientology Disaster Response Team Asking for Volunteers

Typhoon Durian struck the coast of the Philippines this past Thursday and already as many as 1,000 are presumed dead in the aftermath. The Scientology Disaster Response Team is activating any and all volunteers who wish to help aid those in need in the Philippines.

Here is a link to the Scientology Disaster Response website which gives directions on how to help whether this is through donations or volunteering yourself.

Friday, December 01, 2006

SQL Plus Delimited Output File

Ad hoc queries anyone?

Something that has always been a pain is when you need a fast report generated from your latest SQL script. You probably just resort to executing the query in TOAD or another query tool, and then saving it off to a file. That is about 5 mouse clicks too many and you just wish you could click a button to run your script and dump out the results to a nicely formatted text file. Using SQL Plus and a batch script, you can set up an ad hoc query tool for yourself that dumps out the results to a delimited output. This is actually quite simple:

You need three files, a) the SQL query you want to run, b) the SQLPlus script that will run your query and c) a batch program to run your SQLPlus script.

Here is how I did it:

a) queryOut.sql - This contains a query set up for pipe-delimited output. You can change the delimiter to whatever you wish. Here is a sample query -

SELECT field1 || '|' ||
field2 || '|' ||
field3 || '|' ||
field4
FROM {table}
WHERE {criteria};

b) queryToFile.sql - This contains the SQL Plus script including formatting options so you can dump out a file with your results and no other fluff. Here are the settings I used -

set termout off
set feedback off
set verify off
set echo off
set pagesize 0
set linesize 200
set trimspool on
column dt new_Value mydate noprint
select to_char(sysdate, 'YYYYMMDD') dt from dual;

spool rpt&mydate..txt
@queryOut.sql;
spool off
exit;

c) runQuery.bat - This is the simple batch program that calls SQLPLUS with your username and password and the script to run. Here is what this would look like -

SQLPLUS /@ @queryToFile.sql

I set up the above in a directory for these ad hoc queries and then edit my queryOut.sql file and click on the runQuery.bat which I added to a toolbar. Try it out and let me know if you found this useful or if you have any better ideas...