[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Forcibly-unbuffered redirect-to-pipe yields terrible perf
From: |
arnold |
Subject: |
Re: Forcibly-unbuffered redirect-to-pipe yields terrible perf |
Date: |
Thu, 09 Feb 2023 12:06:22 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hi.
A command line option isn't the only way to go. I have some ideas. I
may send you a patch to try.
FYI, the code does not have explicit tests for "is this a pipe" when it
does flushing, it just tends to flush everything.
I suspect that you see better file I/O performance simply because files
are more efficient than pipes: no context switching, and no limitations
on the size of what can be written at once.
Thanks for the (very cool) description of how you use gawk. That's really
neat.
Arnold
<alexandre.ferrieux@orange.com> wrote:
> Hi Arnold,
>
> Indeed knowing the true timeline changes the perspective :)
>
> Clearly, as much as this move was not a necessity (as in "dictated by a
> standard
> like POSIX": Mawk doesn't do that), clearly 33 years down the road, who knows
> how many planes and nuclear power plants rely on it ? So, it seems the only
> valid choice today is to add a command-line option.
>
> Re scientific computing: I have been toying around for 30 years with various
> scripting languages in fields like AI, voice processing, and network
> performance. I've always given the preference to expressivity and readability
> over raw performance: this makes sense when you know you can rewrite in C
> whatever subcomponent is the current bottleneck. At the beginning, Prolog+C,
> then Sh+C, then Tcl+C. But after _many_ years in Tcl+C, trying hard to
> optimize
> the Tcl interpreter itself, I had to admit it: Awk just flies high above. A
> bit
> less expressive, plenty faster. I ended up rewriting as Awk few-liners
> several
> of my C-by-necessity components. Maintainability improved tenfold, iso-perf.
>
> Today, one example is a 512-GB, 56-CPU server processing network metrology
> information. The RAM is used for Awk hash tables, and the many CPUs allow
> high
> modularity: many small bricks of processing multiply connected via pipes. 95%
> of
> them are written in Awk. We *could* rewrite them in C and win a 2-3x perf
> and/or
> RAM factor, but we will not, as we need the flexibility (as we're daily
> reinventing the algorithms) and readability (an atomic processing in 50 lines
> of
> Awk is readable by anyone, even when you forgot to document its API).
>
> One last, noticeable extra: I reuse some of these generic bricks in wildly
> different contexts, like tiny embedded stuff. Yes, they have Gawk most of the
> time :)
>
> Bottom line: Awk is a pillar of exploratory work, even more now than ever
> before !
>
>
>
> On 2/9/23 08:59, arnold@skeeve.com wrote:
> > Hi.
> >
> > Thank you for your note.
> >
> >> 3697ec5c Arnold D. Robbins Thu Jul 15 23:12:49 2010 +0300 Moved to gawk
> >> 2.11.
> >
> > Although this is dated 2010, you'll note the comment that mentions
> > gawk 2.11. It was in 2010 that I built the Git repo based on older
> > versions. 2.11 dates from approximately 1989, so the change is around 33
> > years old!
> >
> > Unsurprisingly, I don't remember the details from that long ago.
> >
> > I suspect that it was to ensure correct semantics when doing
> > things like
> >
> > print "some stuff that goes to stdout"
> > print ... | "some command that send to stdout"
> > print "more stuff that goes to stdout"
> >
> > In such a case, the interleaved output has to come out in the
> > correct order.
> >
> > I will investigate possible changes that would enable buffered
> > output to pipes while not breaking any semantics.
> >
> >> Before this commit, the programmer had the choice; they could call
> >> fflush() or
> >
> > Actually, this is incorrect; fflush() wasn't added to gawk until 3.0,
> > well after the above change.
> >
> > By the way, you mention that you are using gawk for scientific
> > computing. I'm curious, can you give more detail?
> >
> > Thanks,
> >
> > Arnold
> >
> > <alexandre.ferrieux@orange.com> wrote:
> >
> >> Hi,
> >>
> >> When writing into a pipe redirection:
> >>
> >> gawk '{print | "cat > /tmp/foo"}'
> >>
> >> ... gawk *always* handles the pipe as unbuffered. This can be witnessed
> >> with an
> >> external "tail -f /tmp/foo".
> >>
> >> This makes gawk completely unusable for any heavy-duty multipipe output,
> >> as CPU
> >> time is dominated by single-line write() syscalls.
> >>
> >> By contrast, heavy-duty multifile output *is* supported:
> >>
> >> gawk '{print > "/tmp/foo"}'
> >>
> >> ... is fully buffered. What is the logic behind this difference ?
> >> Note: it can be traced to this commit:
> >>
> >> 3697ec5c Arnold D. Robbins Thu Jul 15 23:12:49 2010 +0300 Moved to gawk
> >> 2.11.
> >>
> >> .. with the following comment:
> >>
> >> >Improved handling of output bufferring: now all print[f]s redirected to
> >> >a tty
> >> >or pipe are flushed immediately and non-redirected output to a tty is
> >> >flushed
> >> >before the next input record is read.
> >>
> >> Before this commit, the programmer had the choice; they could call
> >> fflush() or
> >> not, so that both "interactive" and "efficient" use cases were supported.
> >> Afterwards, the choice has disappeared: any write to a pipe is deemed
> >> "interactive", incurring a syscall, and terrible performance.
> >>
> >> Can someone explain why this is an improvement ?
> >>
> >> PS: I do realize this has been the case for 13 years. But maybe it wasn't
> >> spotted before, precisely because Awk was too slow for such heavy-duty
> >> tasks,
> >> back in the days. Now things are different: Awk is a serious candidate for
> >> scientific computing, and such details are just starting to be a problem.
> >>
> >> _________________________________________________________________________________________________________________________
> >>
> >> Ce message et ses pieces jointes peuvent contenir des informations
> >> confidentielles ou privilegiees et ne doivent donc
> >> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
> >> recu ce message par erreur, veuillez le signaler
> >> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
> >> electroniques etant susceptibles d'alteration,
> >> Orange decline toute responsabilite si ce message a ete altere, deforme ou
> >> falsifie. Merci.
> >>
> >> This message and its attachments may contain confidential or privileged
> >> information that may be protected by law;
> >> they should not be distributed, used or copied without authorisation.
> >> If you have received this email in error, please notify the sender and
> >> delete this message and its attachments.
> >> As emails may be altered, Orange is not liable for messages that have been
> >> modified, changed or falsified.
> >> Thank you.
> >>
>
> _________________________________________________________________________________________________________________________
>
> Ce message et ses pieces jointes peuvent contenir des informations
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been
> modified, changed or falsified.
> Thank you.