bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] An gawk problem.


From: Andrew J. Schorr
Subject: Re: [bug-gawk] An gawk problem.
Date: Sat, 19 Oct 2013 09:22:37 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

Hi Guangzong,

As per my first response to you, have you established whether the problem is
with gawk or sort?  The gawk code you provided should be O(N), unless there's a
problem with the hash implementation of the associative array, but the sort is
O(NlogN).  My guess is that the sorting is the problem.  What you really want
for order statistics is a heap.  That would reduce the complexity to O(NlogK).

If you run the first gawk command in the pipeline alone, how long does it take?
If you then run the sort separately, how long does that take?

Regards,
Andy

On Sat, Oct 19, 2013 at 12:16:33PM +0800, 蔡光宗 wrote:
>  I have update my gawk to "GNU Awk 4.0.1" and try again(split the previous
> command chain into separate steps)(and use top/htop... tools to see what's
> happening). But the situation still the same.
> 
> I have never thought this would happen and just want to use gawk to solve
> my question.
> 
> If you don't believe or think i may have some problems on some steps, you
> can try it on your computer.
> 
> And the contents of the file(ip_address.txt) format is as follows:
> 192.168.10.1
> 192.168.10.2
> 192.168.10.3
> ...
> 192.168.10.255
> ...
> ...
> ...
> 253.253.253.253
> 
> When the size of the file is small(about 200MB), everything is OK. But when
> size become large, this situation will happen.
> 
> Regards,
> Guangzong
> 
> 
> 2013/10/18 蔡光宗 <address@hidden>
> 
> > Thank you for your reply and i can't imagine that you would be so quick to
> > reply me(so i was late for this reply).
> > I feel very sorry for my late reply mail.
> >
> > The following are the :
> >
> > 1. The version of gawk :
> > $ gawk --version
> >  GNU Awk 3.1.8
> >
> > 2. The operating system i am using :
> > Ubuntu 12.04
> >  ($ uname -a
> > Linux cruzer-online 3.2.0-51-generic-pae #77-Ubuntu SMP Wed Jul 24
> > 20:40:32 UTC 2013 i686 i686 i386 GNU/Linux
> >  )
> > 3. Where did i get the gawk:
> > When i install the system on my computer, the gawk is contained on it.
> >
> > Sorry again for my late email.
> >
> >
> >
> > 2013/10/18 Andrew J. Schorr <address@hidden>
> >
> >> Hi,
> >>
> >>
> >> > > Date: Fri, 18 Oct 2013 17:41:54 +0800
> >> > > Subject: An gawk problem.
> >> > > From: ?????? <address@hidden>
> >> > > To: address@hidden
> >> > >
> >> > > Hi, Dear Arnold:
> >> > > Thank you for your work on gawk, and thanks for this useful tool so
> >> that i
> >> > > can do some things easily.
> >> > >  But recentlly, i have a problem with gawk, i am not sure that this
> >> can be
> >> > > called a bug, because i can't determine if this is related to my
> >> machine's
> >> > > performance. The following is my problem:
> >> > >
> >> > > awk '{NAME[$1]++}; END {for (name in NAME) print NAME[name], name}'
> >> > > ip_address.txt | sort -nr | head -n1000 | awk '{print $2}'
> >> > >
> >> > > I want to use the command above to get the top 1000 visiter's IP
> >> address.
> >> > > When the size of ip_address.txt is small, everything is OK. But when
> >> the
> >> > > size of ip_address.txt up to 1 Gbyte, the command above don't have any
> >> > > results and still runing after 3 days.(Meanwhile, the memory of my
> >> computer
> >> > > has been swallowed up)
> >>
> >> Is it gawk or sort that is taking so long?  If on linux, you can use "top"
> >> or "ps" to try to figure that out.  Or change your command to tell you
> >> when
> >> the initial awk command has completed:
> >>
> >> awk '{NAME[$1]++}; END {for (name in NAME) print NAME[name], name; print
> >> "Debug: awk first pass finished" > "/dev/stderr"}' ip_address.txt | sort
> >> -nr | head -n1000 | awk '{print $2}'
> >>
> >> Regards,
> >> Andy
> >>
> >



reply via email to

[Prev in Thread] Current Thread [Next in Thread]