bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quotes being stripped by "--csv"


From: Ed Morton
Subject: Re: Quotes being stripped by "--csv"
Date: Thu, 23 Nov 2023 09:00:26 -0600
User-agent: Mozilla Thunderbird

Thanks Arnold, I just replied to your github issue explaining why I think Brian misunderstood my suggestion and use-case. Here that is for anyone not looking at the git issue:


On 11/23/2023 8:02 AM, Arnold Robbins wrote:

I asked Brian Kernighan for his thoughts about this. Here is his response, quoted by permission.

|Date: Wed, 22 Nov 2023 08:01:17 -0500 (EST) From: Brian Kernighan <bwk@....> Subject: Re: CSV output Hi -- I'm pretty skeptical about preserving the it-was-quoted status of input fields, since it seems likely to add distributed complexity and offeres a chance to create more dark corners, for little profit. Does the "quoted" status persist when the value of a field is changed?|
|With my proposal a quoted field would be treated no differently from any other field, there's no status associated with it. When the input is read, quotes are stripped or not, that's all. From then on whether the field is `foo` or `"foo"` is completely irrelevant, the same as if such fields were read without `--csv`.

|
|How about when it is copied? When it's used in a string concatenation? When $0 is reconstituted from $1 ... $NF?|
|Again, none of that matters. Fields that start and end with a quote are just another field, treated the same way as every other field. If the input was:

    this,"foo,bar",that

||and $0 is reconstituted from $1 ... $NF then it simply becomes:|
|
||    this,"foo,bar",that|
|
If a field like `"foo,bar"` is concatenated with a string `whatever` then the result is `"foo,bar"whatever`. If the user wants to do anything else with that, e.g. for printing as part of a valid CSV, they are free to do so.

|
|While one could legitimately argue either way on such things, each would require more code and explanation. It's reminiscent of other intricate states, starting with string vs number vs both, then OFMT vs CONVFMT, and of course the internal states that try to cope with memory management.|
|No, this is very simple and requires almost no explanation and no extra code to do anything, it's just business as usual.

|
|On the flip side, this rule is easy to state and explain: quotes are removed on input quotes can be added when needed with a 2-line function|
|My suggestion is also easy to state and explain:

    when using --csv quotes are removed on input
    when using --csvq quotes are not removed on input
||
Yes, with the current approach quotes can be added when needed but they can't easily be added in a way that duplicates which fields did/didn't have quotes in the input so I can't just do:
|

   |echo 'foo,"bar",etc' | awk --csv -v OFS=',' '{print $1, $2}'||
   |

|or similar and get output of:
|

   |foo,"bar"||
   |

|The best I can come up with to make that work, taking advantage of `--csv` to handle newlines within quoted fields correctly, would be:
|

   |echo 'foo,"bar",etc' ||
   |awk --csv -v OFS=',' '{|
   |    tail = $0|
   |    $0 = ""|
   |    nf = 0|
   |    while ( (tail != "") && match(tail,/([^,]*)|("([^"]*|"")*")/) ) {|
   |        $(++nf) = substr(tail,RSTART,RLENGTH)|
   |        tail = substr(tail,RSTART+RLENGTH+1)|
   |    }|
   ||
   |    print $1, $2|
   |}'|
   |foo,"bar"|
   ||

|People have been writing tools to work on whatever subset of CSV they use in their applications for 50 years - whatever the awk output is passed to for further processing simply may not be able to handle all fields being quoted, or any other algorithm is implemented to make the output CSV valid and it's reasonable for users to expect a simple `print $1, $2` to output the fields quoted as they were in the input.

|
|Color me opposed, I think. Brian |

Closing this issue. Thanks everyone.

That's too bad as I think from Brian's response above that my suggestion was misunderstood as being far more complicated than it actually is.

    Ed.


On 11/23/2023 8:56 AM, arnold@skeeve.com wrote:
Hi Ed.

Ed Morton<mortoneccc@comcast.net>  wrote:

If there was a way to have `--csv` optionally NOT strip double quotes
when reading the fields then that'd solve the problem, e.g. `--csv=q` or
`--csvq` or similar to indicate quotes in and around fields should be
retained. ....
Thanks for the note and for opening the issue on Github.
I asked Brian Kernighan in private mail about about this.
Here is his response, quoted by permission.

Date: Wed, 22 Nov 2023 08:01:17 -0500 (EST)
From: Brian Kernighan<bwk@...>
Subject: Re: CSV output

Hi --

I'm pretty skeptical about preserving the it-was-quoted status of
input fields, since it seems likely to add distributed complexity
and offeres a chance to create more dark corners, for little
profit.

Does the "quoted" status persist when the value of a field is
changed?  How about when it is copied?  When it's used in a string
concatenation?  When $0 is reconstituted from $1 ... $NF?  While
one could legitimately argue either way on such things, each would
require more code and explanation.

It's reminiscent of other intricate states, starting with string
vs number vs both, then OFMT vs CONVFMT, and of course the
internal states that try to cope with memory management.

On the flip side, this rule is easy to state and explain:

    quotes are removed on input
    quotes can be added when needed with a 2-line function

Color me opposed, I think.

Brian
I have to say I agree with him. I note that the gawk manual
has functions similar to those in the second edition of the awk
book, and they're included in the gawk distribution.

Thanks,

Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]