bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16966: [PATCH] grep: optimization with the superset of DFA


From: Norihiro Tanaka
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Sat, 08 Mar 2014 14:41:42 +0900

Package: grep
Tags: patch

DFA may be build the superset of itself, which is the same as the itself
expect ANYCHAR, MBCSET and BACKREF are replaced CSET set full bits
followed by STAR, and mb_cur_max is equal to 1, by the patch.

For example, if given the pattern `a\(b\)c\1', the tokens of original
DFA and the superset is below.

  original: a b CAT c CAT BACKREF CAT

  superset: a b CAT c CAT CSET STAR CAT
            (Full bits are set to CSET.)

If a string doesn't matches the superset of DFA for a pattern, the
string will also never match original DFA.  By the way, matching with
the superset is very fast because it never has ANYCHAR, MBCSET and
BACKREF, which are very expensive, and its mb_cur_max is always equal
to 1.  Therefore, perfomance for matching with DFA may be able to be
dramatically improved without overhead with the superset.

I prepare following string to measure the performance.

    yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k

I run three tests with this patch (best-of-5 trials):

    env LC_ALL=ja_JP.eucJP time -p src/grep -i foobar k
        real 1.77       user 1.23       sys 0.47
    env LC_ALL=en_US.UTF-8 time -p src/grep -i 'j[a-c]d' k
        real 1.86       user 1.35       sys 0.45
    env LC_ALL=C time -p src/grep '\(j\)\1d' k
        real 1.92       user 1.40       sys 0.48

Back out that commit (temporarily), recompile and rerun the experiment:

    env LC_ALL=ja_JP.eucJP time -p src/grep -i foobar k
        real 27.21      user 21.15      sys 5.36
    env LC_ALL=en_US.UTF-8 time -p src/grep -i 'j[a-c]d' k
        real 96.35      user 429.80     sys 57.37
        78.57user 15.16system 1:36.35elapsed 97%CPU (0avgtext+0avgdata 
3296maxresident)k
    env LC_ALL=C time -p src/grep '\(j\)\1d' k
        real 502.32     user 429.80     sys 57.37

Norihiro

Attachment: patch.txt
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]