|
From: | GNU bug Tracking System |
Subject: | [debbugs-tracker] bug#19358: closed ([PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words) |
Date: | Thu, 02 Jun 2016 22:46:01 +0000 |
Your message dated Thu, 2 Jun 2016 15:45:27 -0700 with message-id <address@hidden> and subject line Re: grep: use Aho-Corasick algorithm to search multiple fixed words has caused the debbugs.gnu.org bug report #19358, regarding [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words to be marked as done. (If you believe you have received this mail in error, please contact address@hidden) -- 19358: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19358 GNU Bug Tracking System Contact address@hidden with problems
--- Begin Message ---Subject: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words Date: Sat, 13 Dec 2014 00:58:02 +0900 Hi, Searching multiple fixed words, grep uses Commentz-Walter algorithm, but it is very slow for a worst case e.g. as following. It is O(m*n). - input: yes `printf %040d` | head -10000000 - word1: x0000000000000000000 - word2: x This change uses Aho-Corasick algorithm instead of Commentz-Walter algorithm to search multiple fixed words. It uses a function to build tries which has been already defined for Commentz-Walter algorithm in kwset.c and which has been already high quality. I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5 by this change. First without this change (best-of-5 trials): find /usr/share/doc/ -type f | LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null real 11.37 user 11.03 sys 0.24 Next with this change (best-of-5 trials): find /usr/share/doc/ -type f | LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null real 1.49 user 1.31 sys 0.15 I also wrote two additional patches. Second, If search multiple fixed words, grep immediately returns without longest match if not needed. Without this change, grep tries longest match for multiple words even if not needed. Third, use memchr2 for two patterns of a character. Thanks, Norihiro0001-grep-use-Aho-Corasick-algorithm-to-search-multiple-f.patch
Description: Text document0002-grep-immediately-return-without-longest-match-to-sea.patch
Description: Text document0003-use-memchr2-for-two-patterns-of-a-character.patch
Description: Text document
--- End Message ---
--- Begin Message ---Subject: Re: grep: use Aho-Corasick algorithm to search multiple fixed words Date: Thu, 2 Jun 2016 15:45:27 -0700 Sorry that patch took so long to review. I installed it, along with the attached followup patches which are mostly just minor style things (plus fixing the attribution for a patch that I forgot to specify --author for). User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 I didn't get as much performance improvement on my platform, so I toned down the NEWS item a bit. Still, wow. It is a 2.5x performance improvement for that test case, and it's asymptotically better. Thanks.0001-grep-minor-cleanups-for-F-Aho-Corasick.patch
Description: Source code patch0002-grep-simplify-F-Aho-Corasick-a-bit.patch
Description: Source code patch0003-maint-correct-attribution.patch
Description: Source code patch
--- End Message ---
[Prev in Thread] | Current Thread | [Next in Thread] |