Awk Tab Delimiter (FS="\t")
Quick guide to Awk Field Separators (FS):
Tab Delimited: FS="\t"
Comma Delimited: FS="," (Beware of CSV files. See earlier post on this.)
Pipe Delimited: FS="|"
A developer's guide to Record Linkage, List Hygiene, Database Marketing and Customer Data Integration. Tips and Tricks from Java, SQL, Awk and other programming languages that are useful in achieving an accurate customer database.
Quick guide to Awk Field Separators (FS):
Tab Delimited: FS="\t"
Comma Delimited: FS="," (Beware of CSV files. See earlier post on this.)
Pipe Delimited: FS="|"
Organizations with a central HQ and regional offices, must have some form of CDI in order to accurately understand their customer base. Otherwise you are left with disparate sets of disrelated data, unlinked and unconsolidated, making analysis and effective marketing nearly impossible. So once CDI as a topic is understood, the next question becomes - Build or Buy? After travelling down the "build" path for 2 years I wish I had researched and implemented a commercial product just to avoid the headaches alone...
However, with all that water under the bridge, it would be nice to be able to simply "plug-in" the missing parts to make my CDI solution complete.
I recommend buying a solution outright, even though the cost may be overwhelming at first, it will pay off if your organization is doing any volume of mailing - and will probably pay itself back in postage saved alone.
Here are the key questions that should be asked of a CDI vendor if you are looking to buy:
* Since systems rarely share identifiers, does the solution score and match data taking advantage of all attributes that aid in the matching process using likelihood statistical theory for the highest levels of accuracy? State-of-the-art systems use "probabilistic algorithms" to compare attributes one by one and produce a weighted score for each pair of records. The higher the score, the more likely the records represent the same person. This method improves matching accuracy by using weights that are specific to an organization's data.
* Does the vendor provide analysis based on your own real-world data? Premier providers conduct frequency-based file analysis to provide weighting and thresholds specific to a customer's data. By using your own data, a better match is possible than by using generic criteria and arbitrary weights for each attribute. The data can be studied and tuned to recognize that a match on John Smith in the Bronx, NY, contains a more trusted matching value than a match on John Smith in Minnesota.
* Does the provider offer a way to capture and maintain a complete history of changes to attribute values? Such a method of "complete versioning" improves accuracy and has the ability to make the correct decisions on linking data even as it constantly changes due to marriage, divorce, new address, new phone number, etc.
* Does the solution enable you to establish the right accuracy and identification levels for your application and budget? Managing data quality is a business decision, so consider whether you need automated identification of results to ensure faster customer service, for example, or a solution that allows manual review of data to capture fraudulent customer activity, such as duplicate insurance claims.
* Can the solution overcome multiple identifications and data discrepancies across sources? Common obstacles to customer identity include transpositions, misspellings, nicknames, aliases, address inconsistencies and identity misrepresentations, all of which can occur when customer data flows into the company through multiple touchpoints. A strong CDI product should be able to keep data up to date and synchronized across the enterprise.
* Can you use the same matching technology in order to maintain all your relationships: for example, individuals, households and companies?
(read rest of article here)
The concept of a centralized view or 360 degree view of a customer is often strived for but the "name, want and get" is not always figured out. Most companies will name what they are looking for, some fall short in getting the board to want the solution and many lose it on figuring out how to get it done.
VP of Siperian describes the need for assigning reliability factors to the different sources of data as some sources may have more accurate data:
In order to deliver a “golden” or master record for each customer and its various affiliations, a system must dynamically assess reliability across all data sources – based on user-defined parameters and intrinsic data properties – and ensure that only the most reliable content survives at the cell-level of the master record. For instance, if the call center begins collecting email addresses when confirming orders, this data attribute may be more reliable than the email addresses submitted by customers at the website. The ability to rapidly adjust the system to survive the call center email address over the website email address is a critical architectural component of any CDI system. Moreover, such cell-level survivorship for data reliability must be built into the core product architecture and should not be sacrificed as the customer hub scales to millions of customers. Ultimately, how well the end-users accept a customer data hub depends on sustaining high level of data reliability, even as the hub grows in volume or as new data sources are added.In looking into different CDI solutions, you must look into how the data hub is structured, if they have configurable rules for loading and merging and if they have a robust Data Steward. Realize that data stewardship is the way all 'exceptions' get handled. If you do not separate out the confident updates from the questionable ones, you will have a poor foundation that will eventually lead to a polluted database.
A business cannot implement an operational customer hub in the absence of data stewardship ― as soon as data begins to flow through the supported business processes, exceptions and errors begin to flow as well. Therefore, any customer hub acting as a data integration platform must offer business capabilities to monitor and handle such exceptions either by business analysts and end-users or by anyone designated as a data steward.I've worked on custom CDI projects before CDI was called CDI and can attest that the Data Stewardship interface will make or break your CDI project. This should not be underestimated.
...
Often managing such exceptions requires full user interfaces for complex data stewardship tasks such as investigating the history and lineage of certain master records in question. This may be the only way to ensure that the user acceptance and data hub reliability remains high. In other circumstances, an enterprise may choose to build a specific user interface against the programming interfaces of the master hub in order to suit its needs. In either case, an adaptive solution must deliver rules-based configurability with best-in-class data stewardship consoles as well as programming interfaces to handle all data exception and reliability needs.
Often, certain data attributes (such as privacy preferences) need central control and exception handling whereas other attributes are best left under local management. In addition, security and access to data attributes in the hub will vary by individual roles within each organization and by organization at large. In fact, to support the broad range of business requirements across business lines, there may be multiple data governance regimes required for different data attributes, all within a single enterprise.
An adaptive approach must be based on a distributed architecture whereby multiple hubs can be deployed to integrate different data sources and support different processes, yet be able to share data across one another based on any number of hub-and-spoke or peer-to-peer governance regimes. This offers a line of business yet another dimension of flexibility to share some but not all data – each based on its own data reliability and governance policies. With full rules-based configurability and data stewardship interfaces, a broad range of data governance regimes can be supported.
In summary, data must be the central focus of any CDI project, and data reliability, stewardship and governance regime must never be an after-thought of a comprehensive CDI solution. Enterprises must closely review the details of a vendor’s offering in this area to reduce their risk of project failure and the total cost of CDI implementation.
Posted by Blackfoot at 2:38 PM
Labels: CDI, Data Governance, Data Stewards
I'm drowning in the sea of Customer Data Integration solutions, Data Librarians, Data Stewards, Registry hubs, Persistant hubs or Hybirds. Above is a good review of the basics of the problem domain. Key players are the Oracle Customer Data Hub, Siperian, Purisma and possibly the recent Sun aquisition - SeeBeyond.
"Oracle leads in customer data integration. Oracle has assembled the broadest set of components, from data quality, to matching, to customer key management, to process integration, to ongoing monitoring and management."—Forrester Research
Customer knowledge is the lifeblood of the enterprise. And customer knowledge begins with a unified, reconciled source of high-quality customer data—provided only by the Oracle Customer Data Hub.
Internet seminar: Get A Single View of Your Customers
Internet seminar: Practical Guide to Customer Data Politics
Oracle Customer Data Hub is a fully integrated customer data management solution that centralizes, de-duplicates, and enriches your customer data, continuously synchronizing with all your data sources, to give you a single view of your customers. As new data comes in, reporting accuracy grows, analytics become more valuable, employee productivity increases, and day-to-day customer relationships improve.
Ancient computer found off the coast of Antikythera, Greece is estimated to have been built over 2000 years ago. This makes it the world's oldest known computing device.
Above is a reconstruction of this device is believed to look like. There are over two thousand greek characters inscribed on this machine which helped archeologists decipher what it did and how it worked.
From a recent article in All Headlines News:
Here is a photo of what one of the pieces of this computer actually look like:"...a device known as the Antikythera (an-tee-KITH-air-uh) mechanism, a device consisting of thirty bronze gears with approximately two-hundred-twenty-five teeth, was discovered off the Greek Island of Antikythera.
The Antikythera mechanism, which was made in Greece, is the earliest known computer, though research has shown that the mechanism was not unique for its time. Other instruments similar to the Antikythera have been mentioned in historical documents written in first-century B.C. by Cicero and Posidonius, the teacher of Cicero.
Scientists believe the Antikythera mechanism may have been used in the field of astronomy, showing the motion of the planets and computing eclipses of the sun and moon."
I found it. The use of INSTR() simulates the AWK Function INDEX() and does a search of the first argument to see if the second argument exists and then returns the position where this is first found. Here is the syntax:
This was decribed in the following page which lists out all the Oracle SQL Functions and is a very useful link:INSTR(str, chars[,s[,n]])
Find the 'n'th occurence of 'chars' in 'str'
Starting at position 's'
n and s default to 1
SELECT distinct object_name
FROM all_arguments
WHERE package_name = 'STANDARD';
From my blog you probably can tell I have an infatuation with Awk. Well today I was trying to work out a problem in Oracle SQL where the index() function would have worked perfectly. I'm checking around to find if there is any built in function that searches an existing column for a string and returns the starting point if it exists. Oracle SQL does have the substr() function which I've found quite useful...
Here are some free downloads of AWK that you can install on a windows machine
AT&T Original Awk
Other Awk Downloads
Now that I am writing about AWK I thought I should start off with a descriptive definition of this language. (See full description here):
awk
is a simple and elegant pattern scanning and processing language. It was created in late 70th of the last century. The name was composed from the initial letters of three original authors Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. It is commonly used as a command-line filter in pipes to reformat the output of other commands. It's the precursor and the main inspiration of Perl. Although originated in Unix it is available and widely used in Windows environment too. It takes two inputs: data file and command file. The command file can be absent and necessary commands can be passed as augments. As Ronald P. Loui aptly noted awk is very underappreciated language:
what language we use in our undergraduate AI programming class. That's understandable. We use GAWK. GAWK, Gnu's version of Aho, Weinberger, and Kernighan's old pattern scanning language isn't even viewed as a programming language by most people. Like PERL and TCL, most prefer to view it as a "scripting language." It has no objects; it is
not functional; it does no built-in logic programming. Their surprise turns to puzzlement when I confide that (a) while the students are allowed to use any language they want; (b) with a single exception, the best work consistently results from those working in GAWK. (footnote: The exception was a PASCAL programmer who is now an NSF graduate fellow getting a Ph.D. in mathematics at Harvard.) Programmers in C, C++, and LISP haven't
even been close (we have not seen work in PROLOG or JAVA).The main advantage of AWK is that unlike Perl and other "scripting monsters" that it is very slim without feature creep so characteristic of Perl and thus it can be very efficiently used with pipes. Also it has rather simple, clean syntax and like much heavier TCL can be used with C for "dual-language" implementions.
function setcsv(str, sep) {
gsub(/[\\"]"/, "\035", str)
while (match(str, /"[^"]*"/)) {
middle = substr(str, RSTART+1, RLENGTH-2)
gsub(sep,"\036", middle)
str = substr(str, 1, RSTART-1) middle substr(str, RSTART+RLENGTH)
}
if (index(str, "\"")) {
if ((getline) <= 0)
return 0
setcsv(str "\n" $0, sep)
}
else {
gsub("\035", "\"", str)
gsub(sep, "\034", str)
gsub("\036", sep, str)
$0 = str
return 1
}
powered by performancing firefox
Posted by Blackfoot at 12:38 AM
Labels: Scientology
Ad hoc queries anyone?
Something that has always been a pain is when you need a fast report generated from your latest SQL script. You probably just resort to executing the query in TOAD or another query tool, and then saving it off to a file. That is about 5 mouse clicks too many and you just wish you could click a button to run your script and dump out the results to a nicely formatted text file. Using SQL Plus and a batch script, you can set up an ad hoc query tool for yourself that dumps out the results to a delimited output. This is actually quite simple:
You need three files, a) the SQL query you want to run, b) the SQLPlus script that will run your query and c) a batch program to run your SQLPlus script.
Here is how I did it:
a) queryOut.sql - This contains a query set up for pipe-delimited output. You can change the delimiter to whatever you wish. Here is a sample query -
SELECT field1 || '|' ||b) queryToFile.sql - This contains the SQL Plus script including formatting options so you can dump out a file with your results and no other fluff. Here are the settings I used -
field2 || '|' ||
field3 || '|' ||
field4
FROM {table}
WHERE {criteria};
set termout offc) runQuery.bat - This is the simple batch program that calls SQLPLUS with your username and password and the script to run. Here is what this would look like -
set feedback off
set verify off
set echo off
set pagesize 0
set linesize 200
set trimspool on
column dt new_Value mydate noprint
select to_char(sysdate, 'YYYYMMDD') dt from dual;
spool rpt&mydate..txt
@queryOut.sql;
spool off
exit;
SQLPLUSI set up the above in a directory for these ad hoc queries and then edit my queryOut.sql file and click on the runQuery.bat which I added to a toolbar. Try it out and let me know if you found this useful or if you have any better ideas.../ @ @queryToFile.sql