[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Quotes being stripped by "--csv"
From: |
Ed Morton |
Subject: |
Re: Quotes being stripped by "--csv" |
Date: |
Sun, 26 Nov 2023 07:24:22 -0600 |
User-agent: |
Mozilla Thunderbird |
Ben - I do appreciate your feedback and perspective, thanks, see below
for my response inline.
On 11/25/2023 9:08 PM, Ben Hoyt wrote:
Hi Ed,
It's likely this discussion is moot, given that Arnold said he's not
planning to change Gawk further. However, a few additional thoughts.
> My post is not about input mode vs output mode, it's entirely about
input mode -
> a way to leave the quotes alone or strip them when populating
fields, that is all.
> Output is left entirely up to the user in either case.
Yes, I recognized that's what you were suggesting. I just don't think
that's a very helpful way of operating on CSV fields, because with the
quotes left in you can't really operate on the data -- for example,
you can't fields as numbers or take their sum (the leading quote would
get in the way), and you can't even really treat them as strings
without stripping the quotes (for example, to concatenate a first name
field to last name). In short, the quoted field value would only be
usable if you're going to pass it straight through to the output.
That's correct, but not all CSV-processing applications require
modifying fields and not all applications that do modify fields are
allowed to produce output with different quotes than the input had even
if they have to strip those quotes temporarily while modifying the fields.
I get CSVs from multiple sources and need to compare/manipulate them and
return them to those sources or send to other destinations that would
otherwise receive the original exported CSV. Some of those CSVs are
exported from Excel or other Windows tools, some are exported from
various applications that run on various web sites, some are created by
various Unix tools that have evolved over the years. I see various
quoting styles/rules applied across those CSVs - quote only when needed,
quote all fields, quote all strings but do not quote numbers, quote only
specific columns, quote the data rows but not the header row, etc., etc.
I just counted them and I have 333 gawk scripts using FPAT to manipulate
CSVs plus several other CSV-processing scripts that don't use FPAT (most
written pre-FPAT). For some I just need to map input fields (from
potentially multiple CSVs) to output fields, e.g.:
ARGIND == 1 { file1[FNR] = $3 OFS $7; next }
ARGIND == 2 { file2[FNR] = $12; next }
$3 ~ /whatever/ { print $1, $9, file1[FNR], file2[FNR] }
For others I have to modify some field(s) and today using FPAT I do
something like:
if ( quoted=gsub(/^"|"$/,"",$8) ) { gsub(/""/,"\"",$8) }
$8 = whatever
if ( quoted || ($8 ~ /[\n",]/) ) { gsub(/"/,"\"\",$8); $8 = "\"" $8
"\"" }
So, in each case, the field is simply quoted or not in the output based
on whether it was quoted or not in the input (or would be invalid CSV if
it wasn't quoted).
In that way I just don't care what the quoting rules are for whatever
source I got the CSV from and will send the CSV to, I simply output
whatever quoting style was input and so I KNOW it'll work at the
destination without assuming they can handle all possible, or any
specific, CSV quoting styles.
I obviously know how to do whatever I personally need to do to get the
functionality I need, whether that's rolling my own record-reading
function with `FPAT` to read multi-line fields as I've done in the past
or rolling my own field-splitting function with `--csv` to retain quotes
around fields (which I now realise I probably won't actually do as I'd
have to then remember to call that function again any time I update $0
so it's more impactful to the rest of the script than using FPAT and
calling a record-reading function exactly one time in one location), but
people have been writing tools to parse various subsets of CSVs with
various subsets of allowed/required quoting for 50+ years and CSVs are
used in many varied applications with no 1 common standard they all
follow, despite the existence of RFC4180, so I expect I'm not alone in
having a need for CSV parsing that simply doesn't strip quotes.
Given that, I suggested a mode like `--csv` but that'd leave quotes
alone so we could do whatever we need to do in that regard but the
providers would rather not implement it and that's obviously entirely
their decision and is fine, I don't think we need to discuss it any further.
Ed.
Similarly, the "csv" module in Python and the "encoding/csv" package
in Go (and I presume it's similar in other languages) give you the
un-encoded field value so that you can perform operations on it.
> It is 1 of the 2 possible correct behaviors, and it's the one that I
expect will be most
> useful most of the time.
I suppose it's not helpful to argue over what is "correct" or not, and
I take your point that what you propose is a possible behaviour.
However, I've tried to show above that the field values wouldn't be
very useful without un-encoding the data -- except to pass it directly
to the output. So I definitely disagree with the second part of your
statement. Based on my own usage, I'm very often summing a field or
similar, which wouldn't work with your approach (without further
dequoting/decoding).
To generalize, I think most data processing tends to work this way:
decode input, operate on decoded data, encode output.
In any case, I do think Kernighan's choice to have --csv decode the
input so that you can operate on decoded data is the more helpful
choice, and consistent with what other languages do.
-Ben
- Re: Quotes being stripped by "--csv", (continued)
- Re: Quotes being stripped by "--csv", Manuel Collado, 2023/11/19
- Re: Quotes being stripped by "--csv", Ed Morton, 2023/11/19
- Re: Quotes being stripped by "--csv", Ben Hoyt, 2023/11/19
- Re: Quotes being stripped by "--csv", Ed Morton, 2023/11/23
- Re: Quotes being stripped by "--csv", Manuel Collado, 2023/11/24
- Re: Quotes being stripped by "--csv", arnold, 2023/11/24
- Re: Quotes being stripped by "--csv", Manuel Collado, 2023/11/24
- Re: Quotes being stripped by "--csv", Ed Morton, 2023/11/24
- Re: Quotes being stripped by "--csv", Ed Morton, 2023/11/24
- Re: Quotes being stripped by "--csv", Ben Hoyt, 2023/11/25
- Re: Quotes being stripped by "--csv",
Ed Morton <=
- Re: Quotes being stripped by "--csv", Neil R. Ormos, 2023/11/26
Re: Quotes being stripped by "--csv", arnold, 2023/11/23
Re: Quotes being stripped by "--csv", J Naman, 2023/11/27