[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: big file and quoting
From: |
Ole Tange |
Subject: |
Re: big file and quoting |
Date: |
Fri, 9 Sep 2011 21:18:47 +0200 |
On Thu, Sep 8, 2011 at 5:34 PM, LU Zen <zen.lu@roslin.ed.ac.uk> wrote:
> I’m trying to process a big messy csv file with the --pipe option but I keep
> getting errors such as unexpected EOF while looking for matching `"' or awk:
> cmd. line:1: ^ unexpected newline or end of string. I suspect I’m not using
> the quoting correctly. 2 of the commands I’ve tried:
That is correct. Remember that shell sees ' in pairs: Whenever it
meets another ' then the quoting stops.
> $ cat big.csv | parallel --pipe --files 'awk -v FS="\",\"" '{print $1, $3,
> $4, $5, $9, $14}' | grep -v "#" | sed -e '1d' -e 's/\"//g' -e
> 's/\/\/\//\t/g' | cut -f1-6,11 | sed -e 's/\/\//\t/g' -e 's/ /\t/g'' |
> parallel -Xj1 sort -k1 {} ';' rm {} > big_modified_parallel.csv
I have here grouped your line in quoted and unquoted sections:
quoted: 'awk -v FS="\",\"" '
not quoted: {print $1, $3, $4, $5, $9, $14}
quoted: ' | grep -v "#" | sed -e '
not quoted: 1d
quoted: ' -e '
not quoted: s/\"//g
quoted: ' -e '
not quoted: s/\/\/\//\t/g
quoted: ' | cut -f1-6,11 | sed -e '
not quoted: s/\/\//\t/g
quoted: ' -e '
not quoted: s/ /\t/g
quoted: ''
Followed by:
| parallel -Xj1 sort -k1 {} ';' rm {} > big_modified_parallel.csv
which seems reasonable.
It seems what you have done is to make a composed command that worked
for one file and just added ' around it. But the composed command
itself included ' which therefore got the quoting messed up.
You _can_ give it on the command line. You just need to not use ' as
your quoting character. That means using \ for all special shell
charaters (including space, $, " and '). So basically your command
will contain 30% \ and be completely unreadable. This is a good
brainteaser, but no fun if you need to get stuff done.
The solution is here if you give up:
png ovt.pfi | cnenyyry --svyrf --cvcr njx\ -i\ SF=\"\\\",\\\"\"\
\'\{cevag\ \$1,\ \$3,\ \$4,\ \$5,\ \$9,\ \$14\}\'\ \|\ terc\ -i\
\"\#\"\ \|\ frq\ -r\ \'1q\'\ -r\ \'f/\\\"//t\'\ -r\
\'f/\\/\\/\\//\\g/t\'\ \|\ phg\ -s1-6,11\ \|\ frq\ -r\
\'f/\\/\\//\\g/t\'\ -r\ \'f/\ /\\g/t\' | cnenyyry -Kw1 fbeg -x1 {}
';' ez {} > ovt_zbqvsvrq_cnenyyry.pfi
(It almost looks the same before rot13 :-)
Or you can use the newest git version of GNU Parallel to help you do
the quoting of the composed command:
cat <<'_EOF' | parallel --shellquote
awk -v FS="\",\"" '{print $1, $3, $4, $5, $9, $14}' | grep -v "#" |
sed -e '1d' -e 's/\"//g' -e 's/\/\/\//\t/g' | cut -f1-6,11 | sed -e
's/\/\//\t/g' -e 's/ /\t/g
_EOF
Alternatively you can make sure your composed command does not contain
' by using " and \ instead in which case you _can_ use ' as the
quoting character. The solution to that is:
png ovt.pfi | cnenyyry --svyrf --cvcr 'njx -i SF="\",\"" "{cevag \$1,
\$3, \$4, \$5, \$9, \$14}" | terc -i "#" | frq -r 1q -r f/\"//t -r
"f/\/\/\//\g/t" | phg -s1-6,11 | frq -r "f/\/\//\g/t" -r "f/ /\g/t"' |
cnenyyry -Kw1 fbeg -x1 {} ';' ez {} > ovt_zbqvsvrq_cnenyyry.pfi
My advice, though, is to make a small script. It has the added
advantage that you can comment your code if others need to maintain it
later.
#!/bin/bash
awk -v FS="\",\"" '{print $1, $3, $4, $5, $9, $14}' |
grep -v "#" |
sed -e '1d' -e 's/\"//g' -e 's/\/\/\//\t/g' |
cut -f1-6,11 |
sed -e 's/\/\//\t/g' -e 's/ /\t/g'
Then you can do:
cat big.csv | parallel --pipe --files my_script.sh | parallel -Xj1
sort -k1 {} ';' rm {} > big_modified_parallel.csv
While you are at it: Why not do the sort in the script? Then you can
merge sort (sort -m) in the end. It will most likely be faster:
#!/bin/bash
awk -v FS="\",\"" '{print $1, $3, $4, $5, $9, $14}' |
grep -v "#" |
sed -e '1d' -e 's/\"//g' -e 's/\/\/\//\t/g' |
cut -f1-6,11 |
sed -e 's/\/\//\t/g' -e 's/ /\t/g' |
sort -k1
$ cat big.csv | parallel --pipe --files my_script.sh | parallel -Xj1
sort -mk1 {} ';' rm {} > big_modified_parallel.csv
/Ole