A developer's guide to Record Linkage, List Hygiene, Database Marketing and Customer Data Integration. Tips and Tricks from Java, SQL, Awk and other programming languages that are useful in achieving an accurate customer database.
Sunday, December 31, 2006
Awk Tab Delimiter (FS="\t")
Tab Delimited: FS="\t"
Comma Delimited: FS="," (Beware of CSV files. See earlier post on this.)
Pipe Delimited: FS="|"
Friday, December 29, 2006
CDI - Build or Buy?
However, with all that water under the bridge, it would be nice to be able to simply "plug-in" the missing parts to make my CDI solution complete.
I recommend buying a solution outright, even though the cost may be overwhelming at first, it will pay off if your organization is doing any volume of mailing - and will probably pay itself back in postage saved alone.
Here are the key questions that should be asked of a CDI vendor if you are looking to buy:
* Since systems rarely share identifiers, does the solution score and match data taking advantage of all attributes that aid in the matching process using likelihood statistical theory for the highest levels of accuracy? State-of-the-art systems use "probabilistic algorithms" to compare attributes one by one and produce a weighted score for each pair of records. The higher the score, the more likely the records represent the same person. This method improves matching accuracy by using weights that are specific to an organization's data.
* Does the vendor provide analysis based on your own real-world data? Premier providers conduct frequency-based file analysis to provide weighting and thresholds specific to a customer's data. By using your own data, a better match is possible than by using generic criteria and arbitrary weights for each attribute. The data can be studied and tuned to recognize that a match on John Smith in the Bronx, NY, contains a more trusted matching value than a match on John Smith in Minnesota.
* Does the provider offer a way to capture and maintain a complete history of changes to attribute values? Such a method of "complete versioning" improves accuracy and has the ability to make the correct decisions on linking data even as it constantly changes due to marriage, divorce, new address, new phone number, etc.
* Does the solution enable you to establish the right accuracy and identification levels for your application and budget? Managing data quality is a business decision, so consider whether you need automated identification of results to ensure faster customer service, for example, or a solution that allows manual review of data to capture fraudulent customer activity, such as duplicate insurance claims.
* Can the solution overcome multiple identifications and data discrepancies across sources? Common obstacles to customer identity include transpositions, misspellings, nicknames, aliases, address inconsistencies and identity misrepresentations, all of which can occur when customer data flows into the company through multiple touchpoints. A strong CDI product should be able to keep data up to date and synchronized across the enterprise.
* Can you use the same matching technology in order to maintain all your relationships: for example, individuals, households and companies?
(read rest of article here)
How to create that Golden Master
VP of Siperian describes the need for assigning reliability factors to the different sources of data as some sources may have more accurate data:
In order to deliver a “golden” or master record for each customer and its various affiliations, a system must dynamically assess reliability across all data sources – based on user-defined parameters and intrinsic data properties – and ensure that only the most reliable content survives at the cell-level of the master record. For instance, if the call center begins collecting email addresses when confirming orders, this data attribute may be more reliable than the email addresses submitted by customers at the website. The ability to rapidly adjust the system to survive the call center email address over the website email address is a critical architectural component of any CDI system. Moreover, such cell-level survivorship for data reliability must be built into the core product architecture and should not be sacrificed as the customer hub scales to millions of customers. Ultimately, how well the end-users accept a customer data hub depends on sustaining high level of data reliability, even as the hub grows in volume or as new data sources are added.In looking into different CDI solutions, you must look into how the data hub is structured, if they have configurable rules for loading and merging and if they have a robust Data Steward. Realize that data stewardship is the way all 'exceptions' get handled. If you do not separate out the confident updates from the questionable ones, you will have a poor foundation that will eventually lead to a polluted database.
A business cannot implement an operational customer hub in the absence of data stewardship ― as soon as data begins to flow through the supported business processes, exceptions and errors begin to flow as well. Therefore, any customer hub acting as a data integration platform must offer business capabilities to monitor and handle such exceptions either by business analysts and end-users or by anyone designated as a data steward.I've worked on custom CDI projects before CDI was called CDI and can attest that the Data Stewardship interface will make or break your CDI project. This should not be underestimated.
...
Often managing such exceptions requires full user interfaces for complex data stewardship tasks such as investigating the history and lineage of certain master records in question. This may be the only way to ensure that the user acceptance and data hub reliability remains high. In other circumstances, an enterprise may choose to build a specific user interface against the programming interfaces of the master hub in order to suit its needs. In either case, an adaptive solution must deliver rules-based configurability with best-in-class data stewardship consoles as well as programming interfaces to handle all data exception and reliability needs.
The last point data export back to original sources. This is where you exchange your centralized view with the sources and provide them with your expanded view to assist them in better understanding your customers. If you have done a good job, they will love you, if you haven't you will be public enemy number one. They think their data is "better than yours", "why did you change my address", "you merged two customers into the same one!" and other rantings will be sent your way if you have not done a good job.
You should have a well thought out strategy including what data elements you will send back to sources and you should allow them to decide whether or not they want your data.
Often, certain data attributes (such as privacy preferences) need central control and exception handling whereas other attributes are best left under local management. In addition, security and access to data attributes in the hub will vary by individual roles within each organization and by organization at large. In fact, to support the broad range of business requirements across business lines, there may be multiple data governance regimes required for different data attributes, all within a single enterprise.
An adaptive approach must be based on a distributed architecture whereby multiple hubs can be deployed to integrate different data sources and support different processes, yet be able to share data across one another based on any number of hub-and-spoke or peer-to-peer governance regimes. This offers a line of business yet another dimension of flexibility to share some but not all data – each based on its own data reliability and governance policies. With full rules-based configurability and data stewardship interfaces, a broad range of data governance regimes can be supported.
In summary, data must be the central focus of any CDI project, and data reliability, stewardship and governance regime must never be an after-thought of a comprehensive CDI solution. Enterprises must closely review the details of a vendor’s offering in this area to reduce their risk of project failure and the total cost of CDI implementation.
Friday, December 22, 2006
Customer Data Integration (CDI) Can't Decipher It
"Oracle leads in customer data integration. Oracle has assembled the broadest set of components, from data quality, to matching, to customer key management, to process integration, to ongoing monitoring and management."—Forrester Research
Customer knowledge is the lifeblood of the enterprise. And customer knowledge begins with a unified, reconciled source of high-quality customer data—provided only by the Oracle Customer Data Hub.
Internet seminar: Get A Single View of Your Customers
Internet seminar: Practical Guide to Customer Data Politics
Oracle Customer Data Hub is a fully integrated customer data management solution that centralizes, de-duplicates, and enriches your customer data, continuously synchronizing with all your data sources, to give you a single view of your customers. As new data comes in, reporting accuracy grows, analytics become more valuable, employee productivity increases, and day-to-day customer relationships improve.
Gartner's findings of May 2006 on CDI may already be outdated with the fastly evolving market.
Thursday, December 21, 2006
World's Oldest Computer - Antikythera Mechanism

Ancient computer found off the coast of Antikythera, Greece is estimated to have been built over 2000 years ago. This makes it the world's oldest known computing device.
Above is a reconstruction of this device is believed to look like. There are over two thousand greek characters inscribed on this machine which helped archeologists decipher what it did and how it worked.
From a recent article in All Headlines News:
Here is a photo of what one of the pieces of this computer actually look like:"...a device known as the Antikythera (an-tee-KITH-air-uh) mechanism, a device consisting of thirty bronze gears with approximately two-hundred-twenty-five teeth, was discovered off the Greek Island of Antikythera.
The Antikythera mechanism, which was made in Greece, is the earliest known computer, though research has shown that the mechanism was not unique for its time. Other instruments similar to the Antikythera have been mentioned in historical documents written in first-century B.C. by Cicero and Posidonius, the teacher of Cicero.
Scientists believe the Antikythera mechanism may have been used in the field of astronomy, showing the motion of the planets and computing eclipses of the sun and moon."

Other Antikythera links:
Reconstuction of Antikythera Mechanism
Antikythera Animation of gears
Pictures of mechanism
An Ancient Greek Computer
Oldest Known Computer
Monday, December 18, 2006
Oracle SQL Function - INSTR()
This was decribed in the following page which lists out all the Oracle SQL Functions and is a very useful link:INSTR(str, chars[,s[,n]])
Find the 'n'th occurence of 'chars' in 'str'
Starting at position 's'
n and s default to 1
Oracle SQL Functions
To see a full listing of all Oracle SQL Functions you can run the following query:
SELECT distinct object_name
FROM all_arguments
WHERE package_name = 'STANDARD';
Looking for Index() function in Oracle SQL
Wednesday, December 06, 2006
Awk Executables for Windows
AT&T Original Awk
Other Awk Downloads
What is AWK?
Now that I am writing about AWK I thought I should start off with a descriptive definition of this language. (See full description here):
awk
is a simple and elegant pattern scanning and processing language. It was created in late 70th of the last century. The name was composed from the initial letters of three original authors Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. It is commonly used as a command-line filter in pipes to reformat the output of other commands. It's the precursor and the main inspiration of Perl. Although originated in Unix it is available and widely used in Windows environment too. It takes two inputs: data file and command file. The command file can be absent and necessary commands can be passed as augments. As Ronald P. Loui aptly noted awk is very underappreciated language:
what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a "scripting language." It has no objects; it is
not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't
even been close (we have not seen work in PROLOG or JAVA).The main advantage of AWK is that unlike Perl and other "scripting monsters" that it is very slim without feature creep so characteristic of Perl and thus it can be very efficiently used with pipes. Also it has rather simple, clean syntax and like much heavier TCL can be used with C for "dual-language" implementions.
How to parse a CSV file in AWK
So when I am dealing with a .csv file it is going to have quotes around each field and then a comma between each field.
So then you might say, why not set up the Field Separator like this: FS = "/",/"" which translates into the delimiter looking like ",".
Great, but the first field $1 will be left with a quote in the beginning as awk was not programmed to understand anything differently.
So I am looking for a simple way to parse .csv files using awk.
Here is all I've found so far:
Handling CSV data and a complicated example.
Google Group on AWK CSV parsing
Another Thread related to AWK CSV parsing
From all of this the most workable solution so far is:
function setcsv(str, sep) {
gsub(/[\\"]"/, "\035", str)
while (match(str, /"[^"]*"/)) {
middle = substr(str, RSTART+1, RLENGTH-2)
gsub(sep,"\036", middle)
str = substr(str, 1, RSTART-1) middle substr(str, RSTART+RLENGTH)
}
if (index(str, "\"")) {
if ((getline) <= 0)
return 0
setcsv(str "\n" $0, sep)
}
else {
gsub("\035", "\"", str)
gsub(sep, "\034", str)
gsub("\036", sep, str)
$0 = str
return 1
}
Introduction to Awk
An Awk Primer
Getting Started with Awk
The GNU Awk User's Guide
Awk is your friend
I'll get more into the FS command and try to overcome some of the annoyances I've had with it in my next posts.
powered by performancing firefox
Tuesday, December 05, 2006
Scientology Disaster Response Team Asking for Volunteers
Here is a link to the Scientology Disaster Response website which gives directions on how to help whether this is through donations or volunteering yourself.
Friday, December 01, 2006
SQL Plus Delimited Output File
Something that has always been a pain is when you need a fast report generated from your latest SQL script. You probably just resort to executing the query in TOAD or another query tool, and then saving it off to a file. That is about 5 mouse clicks too many and you just wish you could click a button to run your script and dump out the results to a nicely formatted text file. Using SQL Plus and a batch script, you can set up an ad hoc query tool for yourself that dumps out the results to a delimited output. This is actually quite simple:
You need three files, a) the SQL query you want to run, b) the SQLPlus script that will run your query and c) a batch program to run your SQLPlus script.
Here is how I did it:
a) queryOut.sql - This contains a query set up for pipe-delimited output. You can change the delimiter to whatever you wish. Here is a sample query -
SELECT field1 || '|' ||b) queryToFile.sql - This contains the SQL Plus script including formatting options so you can dump out a file with your results and no other fluff. Here are the settings I used -
field2 || '|' ||
field3 || '|' ||
field4
FROM {table}
WHERE {criteria};
set termout offc) runQuery.bat - This is the simple batch program that calls SQLPLUS with your username and password and the script to run. Here is what this would look like -
set feedback off
set verify off
set echo off
set pagesize 0
set linesize 200
set trimspool on
column dt new_Value mydate noprint
select to_char(sysdate, 'YYYYMMDD') dt from dual;
spool rpt&mydate..txt
@queryOut.sql;
spool off
exit;
SQLPLUSI set up the above in a directory for these ad hoc queries and then edit my queryOut.sql file and click on the runQuery.bat which I added to a toolbar. Try it out and let me know if you found this useful or if you have any better ideas.../ @ @queryToFile.sql