Wednesday, December 06, 2006

How to parse a CSV file in AWK

Okay, so I am not new to Awk - been using it for 3 years. One delimeter type I tend to stay away from is .csv files. I deal with a lot of address data so if you try to do the obvious thing and set the FS = "," you are going to run into a lot of incorrectly parsed fields as address data sometimes includes commas in the street address.

So when I am dealing with a .csv file it is going to have quotes around each field and then a comma between each field.

So then you might say, why not set up the Field Separator like this: FS = "/",/"" which translates into the delimiter looking like ",".

Great, but the first field $1 will be left with a quote in the beginning as awk was not programmed to understand anything differently.

So I am looking for a simple way to parse .csv files using awk.

Here is all I've found so far:

Handling CSV data and a complicated example.
Google Group on AWK CSV parsing
Another Thread related to AWK CSV parsing

From all of this the most workable solution so far is:

function setcsv(str, sep) {
gsub(/[\\"]"/, "\035", str)
while (match(str, /"[^"]*"/)) {
middle = substr(str, RSTART+1, RLENGTH-2)
gsub(sep,"\036", middle)
str = substr(str, 1, RSTART-1) middle substr(str, RSTART+RLENGTH)
}
if (index(str, "\"")) {
if ((getline) <= 0)
return 0
setcsv(str "\n" $0, sep)
}
else {
gsub("\035", "\"", str)
gsub(sep, "\034", str)
gsub("\036", sep, str)
$0 = str
return 1
}

}