coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Command-line program to convert 'human' sizes?


From: Pádraig Brady
Subject: Re: Command-line program to convert 'human' sizes?
Date: Fri, 07 Dec 2012 18:09:27 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1

On 12/07/2012 03:07 PM, Assaf Gordon wrote:
Thank you for your feedback.
I'm working on fixing those issues.


Some comments/questions:

Pádraig Brady wrote, On 12/06/2012 06:59 PM:
I noticed This command will core dump:
$ /bin/ls -l | src/numfmt --to-unit=1 --field=5
<snip>
so I'm thinking `numfmt` should support --header too.

I'll add --header.


The following should essentially be a noop with this data,
but notice how the original spacing wasn't taken
into account, and thus the alignment is broken:

$ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to-unit=1 --field=5
-rw-rw-r--.  1 padraig padraig 93787 Aug 23  2011 ABOUT-NLS
-rw-rw-r--.  1 padraig padraig 49630 Dec  6 22:32 aclocal.m4
-rw-rw-r--.  1 padraig padraig 3669 Dec  6 22:29 AUTHORS

I'm a bit wary of adding automatic/heuristic kind of padding - could lead to 
some weird outputs,
and also (when combined with header) will not produce proper output (because 
the header will be skipped, but the lines would re-padded?).

Wouldn't it be better to either force the user to specify '--padding', or switch from 
'white-space' to an explicit delimiter, and then let "expand" handle the 
expanding correctly?

e.g.
===
$ cat white-space-data.txt | \
     sed 's/  */\t/g' | \
     numfmt --field=5 --delimiter=$'\t' --to=SI | \
     expand > output
===

That doesn't right align numbers unfortunately.

Skipping the header, means skipping the parsing,
but not necessarily skipping the repadding.
Repadding would handle most common cases easily.
I.E. when converting to one of the autoscaled formats,
these have a fixed with, so you could repad according to that.
In the uncommon case of going from autoscaled human format,
you don't know what any row may increase the column width to,
so you'd lose alignment in that case unless --padding was set.

A bit more convoluted, but more reliable?


With this the alignment is broken as before,
but I also notice the differing width output of each number.

$ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to=SI --field=5
-rw-rw-r--.  1 padraig padraig 94k Aug 23  2011 ABOUT-NLS
-rw-rw-r--.  1 padraig padraig 50k Dec  6 22:32 aclocal.m4
-rw-rw-r--.  1 padraig padraig 3.7k Dec  6 22:29 AUTHORS


Again this is the automatic padding issue -
For example "94K" vs "3.7K" - should we always pad SI/IEC output to 5 characters (e.g. 
" 94K") even if the user didn't specify padding?
This would conflict with non-whitespace delimiters... e.g.:

Hello:94000:world

Would be converted to:

Hello:<space>94K:world

Which is not intuitive at all

Or perhaps the whole 'auto' padding should be enabled IFF delimiter is not 
specified (and defaults to white-space) ?

Right, that makes sense.

Notice in the above I've used capital K for SI.
I think human() from gnulib may be using k for 1000 and K for 1024.
That's non standard and ambiguous and I see no need to do that.

So for IEC we'd have:

$ /bin/ls -l | tail -n+2 | head -n3 | src/numfmt --to=IEC --field=5
-rw-rw-r--.  1 padraig padraig  3.6Ki Dec  6 22:29 AUTHORS


I tried to use 'human_readable()' as-is, but I guess this is not sufficient.
I'll duplicate the code, and modify it to avoid this issue (lower/upper case K, and the 
"i" suffix)

Cool, we can look at merging it back after.
Please modify with a view to explicitly selecting
the new behaviour so as to ease remerging with gnulib

Another thing I thought of there, was it would be
good to be able to parse number formats that it can generate:

Sounds like two separate (but related) issues:

$ echo '1,234' | src/numfmt --from=auto
src/numfmt: invalid suffix in input '1,234': ',234'

1. Is there already a gnulib function that can accept locale-grouped values? can the 
"xstrtoXXX" functions handle that?

I was thinking you would just strip out
localeconv()->thousands_sep before parsing.

$ echo '3.7K' | src/numfmt --from=auto
src/numfmt: invalid suffix in input '3.7K': '.7K'

2. Would you recommend switching internal representation to doubles (from the 
current uintmax_t),
  or just add special code to detect decimal point (which, as Bernhard 
mentioned, is also locale dependent).

Yes I think parsing to doubles would be most general.
There is also the consideration of arbitrary-precision arithmetic,
but again that can be considered later.

While I said before it would be better to error rather than warn
on parse error, on consideration it's probably best to write a
warning to stderr on parse error, and leave the original number in place.

I'll change the code accordingly.


Regarding Bernhard's comments (from a different email):

Bernhard Voelker wrote, On 12/07/2012 03:25 AM:
On 12/07/2012 12:59 AM, Pádraig Brady wrote:

Therefore this is my first test:
   $ echo 11505426432 | src/numfmt
   11505426432
Hmm, shouldn't it converting that to a human-readable
number then? ;-)

 From Pádraig's original specification ( http://lists.gnu.org/archive/html/coreutils/2012-02/msg00085.html ) I assumed 
that the default of both "--from" and "--to" is not to scale - So one needs to explicitly use 
"--to" or "--from".

But those defaults can be changed, if you prefer.

Looking at scale_from_args: I'd favor lower-case arguments,
i.e. "si" and "iec" instead of "SI" and "IEC".
WDYT?

I'll change those.


Regarding the help text and documentation:
I copied many of the texts from previous emails (the "Reformat numbers like 
11505426432 to the more human-readable 11G" comes verbatim from one of Jim 
Meyering's emails) - all of them would require better phrasing later.


thanks,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]