[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: article about gawk best practices in data science and feature propos
From: |
Andrew J. Schorr |
Subject: |
Re: article about gawk best practices in data science and feature proposal |
Date: |
Thu, 11 Feb 2021 09:17:45 -0500 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Hi,
On Thu, Feb 11, 2021 at 10:53:19AM +0100, Ivan Molineris wrote:
> Moreover, one of the biggest drawbacks of gawk in our field is the fact
> that, indicating the columns of the input by numbers often produces hard to
> read scripts.
> For this reason in the wrapper I commonly use it is possible to refer to
> columns not only by number, but also by name.
>
> For example, if a file is composed like this:
>
> chromosome start end
> chr1 241 53521
> chr1 363 43623
> chr2 5243 234562
>
> gawk '{l=$2-$1}'
> can be also written as
> gawk '{l=$end-$start}'
>
> I know that this syntax is not back-compatible, maybe can be improved.
>
> Do you know if someone has reasoned about a feature like this one in the
> past?
Regarding this point: I often have files like this with a
header title row. I typically do something like this:
gawk '
NR == 1 {
for (i = 1; i <= NF; i++)
m[$i] = i
# optional: check that all required columns are present
next
}
{
# to take your example
l = $m["end"]-$m["start"]
}'
To me, this is more elegant than hardcoding
gawk -vstart=2 -vend=3 'NR > 1 {l = $end-$start}'
Regards,
Andy