bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: found *yet* another performance issue on Gawk -M - comma formatting


From: arnold
Subject: Re: found *yet* another performance issue on Gawk -M - comma formatting
Date: Tue, 01 Feb 2022 02:20:18 -0700
User-agent: Heirloom mailx 12.5 7/5/10

Hello.

Thank you for the report. Please use the bug-gawk@gnu.org address for
any future reports.

And yes, this report was considerably clearer than your previous one.

I created a file with a number consisting of over 215 million random
digits. I ran this program:

$ cat x.awk
{
        printf("%'f\n", $1)
}

on the file, after compiling gawk for profiling. Here are the
interesting results:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 37.50      0.72     0.72        2   360.01   360.01  mpg_strtoui
 22.40      1.15     0.43        1   430.01   430.01  def_parse_field
 20.83      1.55     0.40       31    12.90    12.90  rs1scan
 19.27      1.92     0.37        1   370.01   370.01  mpg_maybe_float
 ....

Almost all the time is spent in simply scanning the input for
various purposes before converting it to a GMP value. Actually
formatting the value doesn't take much time.

It is thus not surprising that treating the input as a string
and doing gsub on it is faster.

If you are doing serious work with numbers with millions of digits,
gawk is not the right tool to be using. You would be better off with
Python or R or some other tool that is specialized for that kind
of work.

Arnold

"Jason C. Kwan" <jasonckwan@yahoo.com> wrote:

> Not sure if those gnu folks care what I report anymore,  but I just
> found out earlier that  just using the
>
>  %’.f , or % \ 0 4 7 . f
>
> formatting string in printf( ), with gawk -Mbe , and a 275-million digit
> input, took 2 minutes 8.40 seconds
> 
> The same input , using just a basic gsub( )-based approach in gawk -b,
> yielded the same correct answer is just 27.24 seconds
> 
> if you’re interested in investigating, the 4 lines needed to replicate
> the comma formatting in standard gawk I came up is as follows :
>
>
>    sub(/([0-9][0-9][0-9])+([.]|$)/, ",&")  # allocate initial mark, multiple 
> of 3
>   gsub(/[^,.][^,.][^,.]/, "&,")            # batch-process all 3-digit combos
>     sub(/[,]+[.]+/, ".")                   # cleaning-up comma+period instance
>   gsub(/^[^0-9]+|[^0-9]+$/, "")            # fail-safe cleanup at head and 
> tail
> is this readable enough ?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]