parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: big file and quoting


From: Ole Tange
Subject: Re: big file and quoting
Date: Fri, 9 Sep 2011 21:18:47 +0200

On Thu, Sep 8, 2011 at 5:34 PM, LU Zen <zen.lu@roslin.ed.ac.uk> wrote:
> I’m trying to process a big messy csv file with the --pipe option but I keep
> getting errors such as unexpected EOF while looking for matching `"' or awk:
> cmd. line:1:  ^ unexpected newline or end of string. I suspect I’m not using
> the quoting correctly. 2 of the commands I’ve tried:

That is correct. Remember that shell sees ' in pairs: Whenever it
meets another ' then the quoting stops.

> $ cat big.csv | parallel --pipe --files 'awk -v FS="\",\"" '{print $1, $3,
> $4, $5, $9, $14}' | grep -v "#" | sed -e '1d' -e 's/\"//g' -e
> 's/\/\/\//\t/g' | cut -f1-6,11 | sed -e 's/\/\//\t/g' -e 's/ /\t/g'' |
> parallel -Xj1 sort -k1 {} ';' rm {} > big_modified_parallel.csv

I have here grouped your line in quoted and unquoted sections:

quoted: 'awk -v FS="\",\"" '
not quoted: {print $1, $3, $4, $5, $9, $14}
quoted: ' | grep -v "#" | sed -e '
not quoted: 1d
quoted: ' -e '
not quoted: s/\"//g
quoted: ' -e '
not quoted: s/\/\/\//\t/g
quoted: ' | cut -f1-6,11 | sed -e '
not quoted: s/\/\//\t/g
quoted: ' -e '
not quoted: s/ /\t/g
quoted: ''

Followed by:

  | parallel -Xj1 sort -k1 {} ';' rm {} > big_modified_parallel.csv

which seems reasonable.

It seems what you have done is to make a composed command that worked
for one file and just added ' around it. But the composed command
itself included ' which therefore got the quoting messed up.

You _can_ give it on the command line. You just need to not use ' as
your quoting character. That means using \ for all special shell
charaters (including space, $, " and '). So basically your command
will contain 30% \ and be completely unreadable. This is a good
brainteaser, but no fun if you need to get stuff done.

The solution is here if you give up:

png ovt.pfi | cnenyyry --svyrf --cvcr  njx\ -i\ SF=\"\\\",\\\"\"\
\'\{cevag\ \$1,\ \$3,\ \$4,\ \$5,\ \$9,\ \$14\}\'\ \|\ terc\ -i\
\"\#\"\ \|\ frq\ -r\ \'1q\'\ -r\ \'f/\\\"//t\'\ -r\
\'f/\\/\\/\\//\\g/t\'\ \|\ phg\ -s1-6,11\ \|\ frq\ -r\
\'f/\\/\\//\\g/t\'\ -r\ \'f/\ /\\g/t\'   | cnenyyry -Kw1 fbeg -x1 {}
';' ez {} > ovt_zbqvsvrq_cnenyyry.pfi

(It almost looks the same before rot13 :-)

Or you can use the newest git version of GNU Parallel to help you do
the quoting of the composed command:

cat <<'_EOF' | parallel --shellquote
awk -v FS="\",\"" '{print $1, $3, $4, $5, $9, $14}' | grep -v "#" |
sed -e '1d' -e 's/\"//g' -e 's/\/\/\//\t/g' | cut -f1-6,11 | sed -e
's/\/\//\t/g' -e 's/ /\t/g
_EOF

Alternatively you can make sure your composed command does not contain
' by using " and \ instead in which case you _can_ use ' as the
quoting character. The solution to that is:

png ovt.pfi | cnenyyry --svyrf --cvcr 'njx -i SF="\",\"" "{cevag \$1,
\$3, \$4, \$5, \$9, \$14}" | terc -i "#" | frq -r 1q -r f/\"//t -r
"f/\/\/\//\g/t" | phg -s1-6,11 | frq -r "f/\/\//\g/t" -r "f/ /\g/t"' |
cnenyyry -Kw1 fbeg -x1 {} ';' ez {} > ovt_zbqvsvrq_cnenyyry.pfi

My advice, though, is to make a small script. It has the added
advantage that you can comment your code if others need to maintain it
later.

#!/bin/bash

awk -v FS="\",\"" '{print $1, $3, $4, $5, $9, $14}' |
  grep -v "#" |
  sed -e '1d' -e 's/\"//g' -e 's/\/\/\//\t/g' |
  cut -f1-6,11 |
  sed -e 's/\/\//\t/g' -e 's/ /\t/g'

Then you can do:

  cat big.csv | parallel --pipe --files my_script.sh | parallel -Xj1
sort -k1 {} ';' rm {} > big_modified_parallel.csv

While you are at it: Why not do the sort in the script? Then you can
merge sort (sort -m) in the end. It will most likely be faster:

#!/bin/bash

awk -v FS="\",\"" '{print $1, $3, $4, $5, $9, $14}' |
  grep -v "#" |
  sed -e '1d' -e 's/\"//g' -e 's/\/\/\//\t/g' |
  cut -f1-6,11 |
  sed -e 's/\/\//\t/g' -e 's/ /\t/g' |
  sort -k1


$ cat big.csv | parallel --pipe --files my_script.sh | parallel -Xj1
sort -mk1 {} ';' rm {} > big_modified_parallel.csv


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]