[debbugs-tracker] bug#21763: closed (poor performance since grep 2.19 wh

emacs-bug-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#21763: closed (poor performance since grep 2.19 wh

From:	GNU bug Tracking System
Subject:	[debbugs-tracker] bug#21763: closed (poor performance since grep 2.19 when comparing files with grep)
Date:	Wed, 21 Dec 2016 06:46:02 +0000

Your message dated Tue, 20 Dec 2016 21:17:01 -0800
with message-id <address@hidden>
and subject line Re: bug#22357: grep -f not only huge memory usage, but also 
huge time cost
has caused the debbugs.gnu.org bug report #21763,
regarding poor performance since grep 2.19 when comparing files with grep
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
21763: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=21763
GNU Bug Tracking System
Contact address@hidden with problems

--- Begin Message --- Subject: poor performance since grep 2.19 when comparing files with grep Date: Mon, 26 Oct 2015 12:54:01 +0000

Apologies in advance if this is more of a "discuss" question, but it looks like 
a particular use-case shows a marked change in performance between recent 
versions of grep.

A colleague mentioned a performance issue with grep to me, and its puzzling me 
a bit.
It turns out that he was using "grep -Fvif" to find lines in one file that are 
not present in another.

Up until grep 2.18 this seems to work with linear performance and it takes less 
than 50ms to compare files up to about 20,000 lines.
With grep 2.19 and later, ever relatively small files are quite slow, runtime 
(and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s 
to compare 400 lines, 5s to compare 600 lines).

I've shown my colleague how to use sort and diff (and "comm", which I think is 
vastly underrated), but it made me wonder if this is a reasonable thing to 
expect grep to be able to do, and whether such a performance drop should be 
seen as a bug.

The way he was using it, he had two (unsorted) data sets (about 6000 rows in 
each), with most lines being common, and he was just using:
    grep -Fvif FILE1 FILE2
In his case, the older version of grep took way less than a second to run, but 
after he had upgraded his machine it took 20 minutes before running out of swap 
and seg faulting. 

In terms of comparing performance, I've found that the following works to 
compare performance (vary N to try different sized data files):
    N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; 
time grep -Fvif $F $F; rm $F

Steve.

--- End Message ---

--- Begin Message --- Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time cost Date: Tue, 20 Dec 2016 21:17:01 -0800 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 I installed the attached patches into grep master. These fix the performanceregressions noted at the start of Bug#22357. I see that the related performanceproblems noted in Bug#21763 seem to be fixed too, I expect because of NorihiroTanaka's recent changes, so I'll boldly close both bug reports.
To some extent the attached patches restore the old behavior for grep -F, whengrep is given two or more patterns. The patch doesn't change the underlyingalgorithms; it merely uses a different heuristic to decide whether to use the -Fmatcher. Although I wouldn't be surprised if the attached patches hurtperformance in some cases, I didn't uncover any such cases in my performancetesting, which I admit mostly consisted of running the examples in theabovementioned bug reports.
I'll leave Bug#22239 open, as I get the following performance figures(user+system CPU time) for the Bug#22239 benchmark, where list.txt is created by"aspell dump master | head -n 100000 >list.txt", and the grep commands all usethe operands "-F -f list.txt /etc/passwd" in the en_US.utf8 locale on Fedora 24x86-64.
  no -i       -i       grep version
   0.25      0.33      2.16
   0.26     10.95      2.21
   0.11      2.90*     current master (including attached patches)
In the C locale, the current grep master is always significantly faster thangrep 2.16 or 2.21 on the benchmark, so the only significant problem is thenumber marked "*". I ran the benchmarks on an AMD Phenom II X4 910e.
0001-grep-simplify-line-counting-in-patterns.patch
Description: Text Data

0002-grep-simplify-matcher-configuration.patch
Description: Text Data

0003-grep-fix-performance-with-multiple-patterns.patch
Description: Text Data

--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

[debbugs-tracker] bug#21763: closed (poor performance since grep 2.19 when comparing files with grep), GNU bug Tracking System <=

Prev by Date: [debbugs-tracker] bug#22357: closed (grep -f huge memory usage)
Next by Date: [debbugs-tracker] bug#18967: closed (Tramp disables important SSH security features)
Previous by thread: [debbugs-tracker] bug#22357: closed (grep -f huge memory usage)
Next by thread: [debbugs-tracker] bug#18967: closed (Tramp disables important SSH security features)
Index(es):
- Date
- Thread