bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22357: grep -f not only huge memory usage, but also huge time cost


From: Trevor Cordes
Subject: bug#22357: grep -f not only huge memory usage, but also huge time cost
Date: Fri, 9 Dec 2016 01:24:19 -0600
User-agent: Mutt/1.7.1 (2016-10-04)

I bisected this bug to commits:
662b19f2d0edc0bf07f1a2c1421080252df4a37c
468d5217ed6ec1679512dec208c7f30fb8612957
(can't narrow it down because the latter doesn't compile for me)

This bug has hit me hard.  I have a script that wants to do:

grep -w -f /usr/share/dict/words /tmp/greptest
(good older version: 2 seconds to complete, minimal memory)
(any version after the above commits: 10 or more minutes, never waited for 
it to finish, 1.2GB RAM usage and 100% cpu)

Even if /tmp/greptest is empty or has only 1 word in it, this script never 
finishes for me, though I didn't wait more than 10m at 100% CPU.  It takes 
1.2GB of RAM.

If I use the grep version before those commits my sample command above 
runs in 2s!  2s!!!  And it doesn't use up 1.25GB.  The words file is only 
5MB!

It's clear that the commit (which is very simple/tiny) is switching grep 
into a different mode than it used to and this mode is horribly awful with 
large -f input files.  I tested the latest HEAD as of yesterday, bug 
persists.

This bug, and what are almost certainly duplicates (21763,22239), can 
probably be fixed just by backing out or fixing the above commits.

I tried everything suggested in those bugs and I want to note that in my 
case I have always had all my locale env vars set to C:
$ locale
LANG=C
LC_CTYPE="C"
...
LC_ALL=C

The locale doesn't change my results.

My results also do not change if I use -i or not (unlike #22239).
My results also do not change if I use -F or not.

I also want to mention that the commit clearly shows something is faulty 
with the detection and is probably causing contains_encoding_error() to be 
true even though the file in my test (and all the (seq x) tests in this 
bug report) /usr/share/dict/words has no multibyte chars, only ascii!  How 
can a pure ascii file possibly contains_encoding_error?  So something with 
this whole commit's logic is just plain wrong.

I would strongly suggest everyone would be happier if grep thinks it has a 
file that has an encoding it can't deal with that it just errors out and 
aborts, rather than switching to a mode that turns a used-to-take 2s (for 
the past 10+ years) run into hours and RAM exhaustion!  Then the user can 
simply clean up his input files to make them compliant.  Sounds reasonable 
to me, as no program is required to deal with garbage input.

Or we need a switch that can disable this bogus mode switch. 
--i-know-what-i-am-doing-stay-in-2s-mode-not-2h-mode

I am able to work on testing patches rapidly if people want to throw ideas 
my way.  Until this gets fixed I'll just have to maintain my own binary 
rpm that reverses that commit.

Thanks!





reply via email to

[Prev in Thread] Current Thread [Next in Thread]