bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Forcibly-unbuffered redirect-to-pipe yields terrible perf


From: arnold
Subject: Re: Forcibly-unbuffered redirect-to-pipe yields terrible perf
Date: Thu, 09 Feb 2023 12:06:22 -0700
User-agent: Heirloom mailx 12.5 7/5/10

Hi.

A command line option isn't the only way to go. I have some ideas. I
may send you a patch to try.

FYI, the code does not have explicit tests for "is this a pipe" when it
does flushing, it just tends to flush everything.

I suspect that you see better file I/O performance simply because files
are more efficient than pipes: no context switching, and no limitations
on the size of what can be written at once.

Thanks for the (very cool) description of how you use gawk.  That's really
neat.

Arnold

<alexandre.ferrieux@orange.com> wrote:

> Hi Arnold,
>
> Indeed knowing the true timeline changes the perspective :)
>
> Clearly, as much as this move was not a necessity (as in "dictated by a 
> standard 
> like POSIX": Mawk doesn't do that), clearly 33 years down the road, who knows 
> how many planes and nuclear power plants rely on it ? So, it seems the only 
> valid choice today is to add a command-line option.
>
> Re scientific computing: I have been toying around for 30 years with various 
> scripting languages in fields like AI, voice processing, and network 
> performance. I've always given the preference to expressivity and readability 
> over raw performance: this makes sense when you know you can rewrite in C 
> whatever subcomponent is the current bottleneck. At the beginning, Prolog+C, 
> then Sh+C, then Tcl+C. But after _many_ years in Tcl+C, trying hard to 
> optimize 
> the Tcl interpreter itself, I had to admit it: Awk just flies high above. A 
> bit 
> less expressive, plenty faster. I ended up rewriting as Awk few-liners 
> several 
> of my C-by-necessity components. Maintainability improved tenfold, iso-perf.
>
> Today, one example is a 512-GB, 56-CPU server processing network metrology 
> information. The RAM is used for Awk hash tables, and the many CPUs allow 
> high 
> modularity: many small bricks of processing multiply connected via pipes. 95% 
> of 
> them are written in Awk. We *could* rewrite them in C and win a 2-3x perf 
> and/or 
> RAM factor, but we will not, as we need the flexibility (as we're daily 
> reinventing the algorithms) and readability (an atomic processing in 50 lines 
> of 
> Awk is readable by anyone, even when you forgot to document its API).
>
> One last, noticeable extra: I reuse some of these generic bricks in wildly 
> different contexts, like tiny embedded stuff. Yes, they have Gawk most of the 
> time :)
>
> Bottom line: Awk is a pillar of exploratory work, even more now than ever 
> before !
>
>
>
> On 2/9/23 08:59, arnold@skeeve.com wrote:
> > Hi.
> > 
> > Thank you for your note.
> > 
> >> 3697ec5c  Arnold D. Robbins  Thu Jul 15 23:12:49 2010 +0300  Moved to gawk 
> >> 2.11.
> > 
> > Although this is dated 2010, you'll note the comment that mentions
> > gawk 2.11.  It was in 2010 that I built the Git repo based on older
> > versions. 2.11 dates from approximately 1989, so the change is around 33
> > years old!
> > 
> > Unsurprisingly, I don't remember the details from that long ago.
> > 
> > I suspect that it was to ensure correct semantics when doing
> > things like
> > 
> >     print "some stuff that goes to stdout"
> >     print ... | "some command that send to stdout"
> >     print "more stuff that goes to stdout"
> > 
> > In such a case, the interleaved output has to come out in the
> > correct order.
> > 
> > I will investigate possible changes that would enable buffered
> > output to pipes while not breaking any semantics.
> > 
> >> Before this commit, the programmer had the choice; they could call 
> >> fflush() or
> > 
> > Actually, this is incorrect; fflush() wasn't added to gawk until 3.0,
> > well after the above change.
> > 
> > By the way, you mention that you are using gawk for scientific
> > computing. I'm curious, can you give more detail?
> > 
> > Thanks,
> > 
> > Arnold
> > 
> > <alexandre.ferrieux@orange.com> wrote:
> > 
> >> Hi,
> >>
> >> When writing into a pipe redirection:
> >>
> >>    gawk '{print | "cat > /tmp/foo"}'
> >>
> >> ... gawk *always* handles the pipe as unbuffered. This can be witnessed 
> >> with an 
> >> external "tail -f /tmp/foo".
> >>
> >> This makes gawk completely unusable for any heavy-duty multipipe output, 
> >> as CPU 
> >> time is dominated by single-line write() syscalls.
> >>
> >> By contrast, heavy-duty multifile output *is* supported:
> >>
> >>    gawk '{print  > "/tmp/foo"}'
> >>
> >> ... is fully buffered. What is the logic behind this difference ?
> >> Note: it can be traced to this commit:
> >>
> >> 3697ec5c  Arnold D. Robbins  Thu Jul 15 23:12:49 2010 +0300  Moved to gawk 
> >> 2.11.
> >>
> >> .. with the following comment:
> >>
> >> >Improved handling of output bufferring:  now all print[f]s redirected to 
> >> >a tty
> >> >or pipe are flushed immediately and non-redirected output to a tty is 
> >> >flushed
> >> >before the next input record is read.
> >>
> >> Before this commit, the programmer had the choice; they could call 
> >> fflush() or 
> >> not, so that both "interactive" and "efficient" use cases were supported.
> >> Afterwards, the choice has disappeared: any write to a pipe is deemed 
> >> "interactive", incurring a syscall, and terrible performance.
> >>
> >> Can someone explain why this is an improvement ?
> >>
> >> PS: I do realize this has been the case for 13 years. But maybe it wasn't 
> >> spotted before, precisely because Awk was too slow for such heavy-duty 
> >> tasks, 
> >> back in the days. Now things are different: Awk is a serious candidate for 
> >> scientific computing, and such details are just starting to be a problem.
> >>
> >> _________________________________________________________________________________________________________________________
> >>
> >> Ce message et ses pieces jointes peuvent contenir des informations 
> >> confidentielles ou privilegiees et ne doivent donc
> >> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez 
> >> recu ce message par erreur, veuillez le signaler
> >> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> >> electroniques etant susceptibles d'alteration,
> >> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> >> falsifie. Merci.
> >>
> >> This message and its attachments may contain confidential or privileged 
> >> information that may be protected by law;
> >> they should not be distributed, used or copied without authorisation.
> >> If you have received this email in error, please notify the sender and 
> >> delete this message and its attachments.
> >> As emails may be altered, Orange is not liable for messages that have been 
> >> modified, changed or falsified.
> >> Thank you.
> >>
>
> _________________________________________________________________________________________________________________________
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]