[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: csv with multiline records
From: |
Ole Tange |
Subject: |
Re: csv with multiline records |
Date: |
Fri, 16 Dec 2016 03:57:10 +0100 |
On Thu, Dec 15, 2016 at 3:35 AM, Ryan Brothers <ryan.brothers@gmail.com> wrote:
> On Wed, Dec 14, 2016 at 2:36 PM, Ole Tange <ole@tange.dk> wrote:
> > But if you can somehow replace the record separator, then you can use
> > --recend.
> >
> > Given your input this might work:
> >
> > parallel --pipe --recend '"\n"
> >
> > assuming a good part of the records have a last column with newlines.
>
> Thank you for your help. I can't assume the last column will always
> have newlines, but your suggestion with --recend gave me an idea to do
> something like:
>
> cat file.csv | php reformat.php | parallel --pipe --recend '@@@'
> --remove-rec-sep wc
Unless @@@ can be in your real data, then that should work just fine.
I often use \0 = NUL because that can be very hard for even a
malicious user to enter. Only if the data is binary will \0 not work.
> reformat.php is a PHP script that reads the csv and writes it out to
> stdout with @@@ in-between each record.
>
> That seems to work great except I don't believe I can use --pipepart
> with this method because the csv with @@@ is generated on the fly.
True. So it is:
cat file.csv | php reformat.php | parallel --pipe --recend '@@@'
--remove-rec-sep wc
Or:
cat file.csv | php reformat.php >tmpfile
parallel --pipepart --recend '@@@' --remove-rec-sep wc :::: tmpfile
My bet is the first is the faster as you avoid saving on disk.
> I would have to save the reformatted csv file to disk. Do you have any
> thoughts to get around that? If not, generating a new csv file in
> this format would also be ok for my use case.
How do you in reformat.php determine where the @@@ should be placed?
Can you use a combination of --recend/--recstart to do that? With
--regexp?
If your input lines all start the same way like:
> row1,"1
> 2
> 3"
> row2,4
Then this might work (allowing up to row999999):
--recend '\n' --recstart 'row\d{1,6},' --regexp
It *will* ofcourse f*ck up if the "quoted" string contains
"\nrow123,this is not a new row".
/Ole