emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#19358: closed ([PATCH 1/3] grep: use Aho-Corasick


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#19358: closed ([PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words)
Date: Thu, 02 Jun 2016 22:46:01 +0000

Your message dated Thu, 2 Jun 2016 15:45:27 -0700
with message-id <address@hidden>
and subject line Re: grep: use Aho-Corasick algorithm to search multiple fixed 
words
has caused the debbugs.gnu.org bug report #19358,
regarding [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed 
words
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
19358: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19358
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words Date: Sat, 13 Dec 2014 00:58:02 +0900
Hi,

Searching multiple fixed words, grep uses Commentz-Walter algorithm, but
it is very slow for a worst case e.g. as following.  It is O(m*n).

  - input: yes `printf %040d` | head -10000000
  - word1: x0000000000000000000
  - word2: x

This change uses Aho-Corasick algorithm instead of Commentz-Walter
algorithm to search multiple fixed words.  It uses a function to build
tries which has been already defined for Commentz-Walter algorithm in
kwset.c and which has been already high quality.

I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
by this change.

First without this change (best-of-5 trials):

    find /usr/share/doc/ -type f |
    LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words 
>/dev/null
        real 11.37      user 11.03      sys 0.24

Next with this change (best-of-5 trials):

    find /usr/share/doc/ -type f |
    LC_ALL=C  time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words 
>/dev/null
        real 1.49       user 1.31       sys 0.15

I also wrote two additional patches.

Second, If search multiple fixed words, grep immediately returns without
longest match if not needed.  Without this change, grep tries longest
match for multiple words even if not needed.

Third, use memchr2 for two patterns of a character.

Thanks,
Norihiro

Attachment: 0001-grep-use-Aho-Corasick-algorithm-to-search-multiple-f.patch
Description: Text document

Attachment: 0002-grep-immediately-return-without-longest-match-to-sea.patch
Description: Text document

Attachment: 0003-use-memchr2-for-two-patterns-of-a-character.patch
Description: Text document


--- End Message ---
--- Begin Message --- Subject: Re: grep: use Aho-Corasick algorithm to search multiple fixed words Date: Thu, 2 Jun 2016 15:45:27 -0700 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 Sorry that patch took so long to review. I installed it, along with the attached followup patches which are mostly just minor style things (plus fixing the attribution for a patch that I forgot to specify --author for).

I didn't get as much performance improvement on my platform, so I toned down the NEWS item a bit. Still, wow. It is a 2.5x performance improvement for that test case, and it's asymptotically better. Thanks.

Attachment: 0001-grep-minor-cleanups-for-F-Aho-Corasick.patch
Description: Source code patch

Attachment: 0002-grep-simplify-F-Aho-Corasick-a-bit.patch
Description: Source code patch

Attachment: 0003-maint-correct-attribution.patch
Description: Source code patch


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]