[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Regular expressions with high or nested repetition counters
From: |
Dag Hovland |
Subject: |
Regular expressions with high or nested repetition counters |
Date: |
Mon, 03 May 2010 09:51:36 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Thunderbird/3.0.4 |
Hi!
I am a PhD student doing some research in regular expressions. I have
implemented an algorithm for matching a subclass of the regular
expressions with "numerical constraints", that is, expressions like
"a{10,100}". My starting point was the problems that can occur when
using large numbers, or nested constraints, as for example in the command:
grep -E "([0-9]{1,2}h([1-5]?[0-9]m([1-5]?[0-9]s){1,60}){1,60}){0,100}"
which consumes over 2 gb of memory on my computer.
from the grep man-page:
"Large repetition counts in the {n,m} construct may cause grep to use
lots of memory."
An article describing the algorithm is available at
http://hdl.handle.net/1956/3628
A C-implementation is available at
http://www.ii.uib.no/~dagh/facgrep
The implementation is probably buggy, as I am not very good in C, but it
works for simple examples. On the example above, procps reports that it
uses around 2 megabytes of memory.
I wondered if you are interested in having this kind of functionality in
GNU grep? Note that for the majority of normal/simple regexps, grep does
better than my program above. Also, it only handles a (well-defined)
subclass of regexps. So we would need to find a nice way to choose which
algorithm to use when.
Best regards,
Dag Hovland
smime.p7s
Description: S/MIME Cryptographic Signature
- Regular expressions with high or nested repetition counters,
Dag Hovland <=