bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gawk] Parsing standard CVS data by gawk


From: Jarno Suni
Subject: [bug-gawk] Parsing standard CVS data by gawk
Date: Wed, 8 Jul 2015 00:17:17 +0300

Current manual tells:
"NOTE: Some programs export CSV data that contains embedded newlines
between the double quotes. gawk provides no way to deal with this. Even
though a formal specification for CSV data exists, there isn’t much
more to be done; the FPAT mechanism provides an elegant solution for
the majority of cases, and the gawk developers are satisfied with that."
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

I think this is a bit misleading, since standard CSV data can be parsed
by gawk. The following script reads all CVS data in a two-dimensional
array that is used in the END section of the Gawk program to display
the fields together with their array indexes:

dos2unix | gawk '
function strip_quoted_field(s)
{
        s = substr(s, 2, length(s) - 2)
        gsub(/""/, "\"", s)
        return s
}
BEGIN{
        RS = "" # read the whole input file as one record
        FS = "" # I guess this setting reduces internal splitting work
        record = 0;
}
{
        nof = patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
        field = 0;
        for (i = 1; i <= nof; i++) {
                field++         
                if (substr(a[i], 1, 1) == "\"") 
                  f[record][field] = strip_quoted_field(a[i])
                  else f[record][field] = a[i]
                if (seps[i] != ",") { field=0; record++ }
                delete a[i]     
        }
}
END{
        field=length(f[0])
        for (i = 0; i < record; i++) 
                for (j = 1; j <= field; j++)
                        printf i" "j" :"f[i][j]"\r\n"
}'

dos2unix utility is used to convert standard DOS style line breaks
(CRLF i.e. "\r\n") and possible UTF-16 encoding (with byte order mark)
to "\n" and UTF-8 (without byte order mark), respectively. The script
also works with plain Unix-style UTF-8 input in Linux my experience.

For some use cases it is not necessary to have all data in memory in
one time. This might not be optimal implementation for such a case. I
also wrote an implementation that reads input line by line.

Regards,

-- 
Jarno Ilari Suni - http://www.iki.fi/8/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]