[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #29391] -i and utf8 slowness, speedup idea
From: |
Egmont Koblinger |
Subject: |
[bug #29391] -i and utf8 slowness, speedup idea |
Date: |
Wed, 31 Mar 2010 09:13:47 +0000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100217 Firefox/3.5.8 |
URL:
<http://savannah.gnu.org/bugs/?29391>
Summary: -i and utf8 slowness, speedup idea
Project: grep
Submitted by: egmont
Submitted on: Wed 31 Mar 2010 09:13:46 AM GMT
Category: None
Severity: 3 - Normal
Item Group: None
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
_______________________________________________________
Details:
[ followup of https://bugzilla.redhat.com/show_bug.cgi?id=194471 ]
The combination of --ignore-case and UTF-8 is very slow, even when no special
treatment is required for UTF-8. There's also a huge regression in speed
compared to Ubuntu Hardy's grep-2.5.3 with I-don't-know-what patches. Some
timing data:
$ # Create a 100MB file.
$ dd if=/dev/urandom of=data bs=1M count=100
$ # Ubuntu's grep-2.5.3, UTF-8: fast
$ time LC_ALL=en_US.UTF-8 /bin/fgrep -i foobar data
real 0m0.245s
user 0m0.128s
sys 0m0.112s
$ # Stock grep-2.6.2, 8-bit: fast
$ time LC_ALL=en_US fgrep -i foobar data
real 0m0.156s
user 0m0.084s
sys 0m0.072s
$ # Stock grep-2.6.2, UTF-8: very slow
$ time LC_ALL=en_US.UTF-8 fgrep -i foobar data
real 0m10.264s
user 0m10.049s
sys 0m0.080s
This is approximately a 40-60x slowdown.
I understand that the combination of UTF-8 and ignore-case is a tricky
situation, and if I'm having tr_TR.UTF-8 locale then sure I want to pay this
price for the correct handling of dotless i's.
Most of the time, however, I'm working with en_US.UTF-8 and grepping variable
names in source code and such, usually without any accents.
Grep could do the following:
It could look at the pattern, and check if the following conditions are all
true:
- no placeholder that could match a variable-length character (e.g. no "." in
the pattern) or other weird stuff
- only ASCII characters
- only characters whose old-fashioned ASCII upper/lowercase counterparts are
the same as the locale-aware upper/lowercase counterparts, that is, no "i" or
"I" in the pattern if the locale is Turkish.
If all these are true, it could use whatever algorithm it's using for 8-bit
locales, because it will find the same matches. This would provide a 40-60x
speedup for a very common use case: case insensitively finding an English
word.
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?29391>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [bug #29391] -i and utf8 slowness, speedup idea,
Egmont Koblinger <=