[Regexp] performance improvements for gnu.regexp 1.1.4

gnu-regexp-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Regexp] performance improvements for gnu.regexp 1.1.4

From:	dirk bergstrom
Subject:	[Regexp] performance improvements for gnu.regexp 1.1.4
Date:	Tue, 27 May 2003 19:23:33 -0700
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b) Gecko/20030507

we need java-based gnu-style regular expressions for a project (which
involves regexes in oracle queries).  calling out to C is too slow, so java
seemed to be the best path.  since we need the same behavior as the standard
C gnu regex package, we chose gnu.regexp.

however, benchmarking (http://tusker.org/regex/regex_benchmark.html) shows
gnu.regexp to be *much* slower than other java regex packages -- almost
thirty times slower than java.util.regex.

so i sat down with a profiler (http://ejp.sourceforge.net/), and found a few
things to improve upon:

*) calls to REMatch.clone() took a lot of time.  replacing them with a
copy-constructor made a *huge* difference.  i suspect that there's a lot of
overhead in clone() that's not needed in this case.

*) the positions vector in RETokenRepeated was replaced with a typed array
(and some code to reallocate it as needed, just like in vector).  this saved
a fair amount of time, though not as much as the clone stuff.

*) several for loops had calls to foo.size() or foo.length in the comparison
clause, like so:
    for (int i=0; i<foo.size(); i++)
i changed them to look like this:
    for (int i=0, j=foo.size(); i<j; i++)
this saves a call to size() for every iteration.  not substantial, but it helps.

*) i tried to eliminate calls to String.charAt() in
CharIndexedString.charAt() by replacing the private String with a char[]
(which i got by using toCharArray() in the constructor).  however, this had
little or no effect on speed.  String.charAt() is a pretty low-overhead
method...

i've posted screenshots of the profiler output (before, after, and for
comparison, a profile of java.util.regex), and a clean copy of the diffs, here:

http://otisbean.com/gnu-regexp/

i ran the testsuite in gnu.regexp.util.Tests on the new code, and everything
passed.

my tweaked code is almost three times faster than the 1.1.4 code (on the
benchmark cited above).  unfortunately, it's still 10x slower than
java.util.regex...

i can't see much more low-hanging fruit.  the biggest time-sink is calls to
the copy-constructor in REMatch.  i think that the only way gnu.regexp is
going to get a lot faster is a new architecture that doesn't rely on copying
REMatch objects.  unfortunately, that's beyond my current available time.

hopefully my patches will be valuable.

i'm not subscribed to the list, so please CC me on any replies.  thanks.

-- 
Dirk Bergstrom               address@hidden
_____________________________________________
Juniper Networks Inc.,          Computer Geek
Tel: 408.745.3182           Fax: 408.745.8905

[Prev in Thread]

Current Thread

[Next in Thread]

[Regexp] performance improvements for gnu.regexp 1.1.4, dirk bergstrom <=
- Re: [Regexp] performance improvements for gnu.regexp 1.1.4, Wes Biggs, 2003/05/28

Next by Date: Re: [Regexp] performance improvements for gnu.regexp 1.1.4
Next by thread: Re: [Regexp] performance improvements for gnu.regexp 1.1.4
Index(es):
- Date
- Thread