[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a
From: |
Thomas Wolff |
Subject: |
Re: 1.7] BUG - GREP slows to a crawl with large number of matches on a single file |
Date: |
Fri, 06 Nov 2009 16:00:00 +0100 |
User-agent: |
Thunderbird 2.0.0.23 (Windows/20090812) |
Corinna Vinschen wrote:
On Nov 6 06:56, Eric Blake wrote:
According to Corinna Vinschen on 11/6/2009 6:51 AM:
The problem *is* with grep (and sed), however, because there is no
good reason that UTF-8 should give us a penalty of being 100times
slower on most search operations, this is just poor programming of
grep and sed.
The penalty on Linux is much smaller, about 15-20%.
On my Linux system, the penalty is a factor between 6 and 8.
It looks like
grep is calling malloc for every input line if MB_CUR_MAX is > 1.
Then it evaluates for each byte in the line whether the byte is a
single byte or the start of a multibyte sequence using mbrtowc on
every charatcer on the input line. Then, for each potential match,
it checks if it's the start byte of a multibyte sequence and ignores
all other matches. Eventually, it calls free, and the game starts
over for the next line.
Adding bug-grep, since this slowdown caused by additional mallocs is
definitely the sign of a poor algorithm that could be improved by reusing
existing buffers.
I created a simple testcase:
==== SNIP ===
...
==== SNAP ====
I extended your test program to demonstrate the inefficiency of the
standard mbrtowc function. Instead I use a function from my editor
(mined) to extract a Unicode character from a UTF-8 sequence. This is
the simple case only, not converting character sets other than UTF-8 but
that's the same thing mbrtowc does in the sample invocation. Program
attached. Results below.
Under Cygwin (tcsh time output):
$ setenv LANG en_US.UTF-8
$ time ./mb 1000000 1 0
with malloc: 1, with mbrtowc: 0
0.328u 0.031s 0:00.34 102.9% 0+0k 0+0io 1834pf+0w
$ time ./mb 1000000 0 1
with malloc: 0, with mbrtowc: 1
1.921u 0.092s 0:02.09 96.1% 0+0k 0+0io 1827pf+0w
$ time ./mb 1000000 1 1
with malloc: 1, with mbrtowc: 1
2.062u 0.140s 0:02.15 102.3% 0+0k 0+0io 1839pf+0w
Running on the same CPU under Linux:
$ setenv LANG en_US.UTF-8
$ time ./mb 1000000 1 0
with malloc: 1, with mbrtowc: 0
0.088u 0.004s 0:00.09 88.8% 0+0k 0+0io 0pf+0w
$ time ./mb 1000000 0 1
with malloc: 0, with mbrtowc: 1
1.836u 0.000s 0:01.85 98.9% 0+0k 0+0io 0pf+0w
$ time ./mb 1000000 1 1
with malloc: 1, with mbrtowc: 1
1.888u 0.000s 0:01.93 97.4% 0+0k 0+0io 0pf+0w
So, while Linux is definitely faster, the number are still comparable
for 1 million iterations. That still doens't explain why grep is a
multitude slower when using UTF-8 as charset.
Results of mbrtowc vs. utftouni on Linux:
address@hidden:~/tmp: locale charmap
UTF-8
address@hidden:~/tmp: time ./uu 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0
real 0m2.897s
user 0m2.836s
sys 0m0.012s
address@hidden:~/tmp: time ./uu 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1
real 0m0.030s
user 0m0.028s
sys 0m0.000s
address@hidden:~/tmp:
Results of mbrtowc vs. utftouni on cygwin:
address@hidden:/H/tools: time ./uu.exe 1000000 0 1 0
with malloc: 0, with mbrtowc: 1, with utftouni: 0
real 0m3.034s
user 0m2.974s
sys 0m0.030s
address@hidden:/H/tools: time ./uu.exe 1000000 0 0 1
with malloc: 0, with mbrtowc: 0, with utftouni: 1
real 0m0.110s
user 0m0.070s
sys 0m0.030s
address@hidden:/H/tools:
The conclusion is, as long as calling mbrtowc is as inefficient, a
program caring about performance should not use it.
Thomas
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <string.h>
void utf8_info (u, length, ucs)
char * u;
int * length;
unsigned long * ucs;
{
char * textpoi = u;
unsigned char c = * textpoi;
int utfcount;
unsigned long unichar;
if ((c & 0x80) == 0x00) {
utfcount = 1;
unichar = c;
} else if ((c & 0xE0) == 0xC0) {
utfcount = 2;
unichar = c & 0x1F;
} else if ((c & 0xF0) == 0xE0) {
utfcount = 3;
unichar = c & 0x0F;
} else if ((c & 0xF8) == 0xF0) {
utfcount = 4;
unichar = c & 0x07;
} else if ((c & 0xFC) == 0xF8) {
utfcount = 5;
unichar = c & 0x03;
} else if ((c & 0xFE) == 0xFC) {
utfcount = 6;
unichar = c & 0x01;
} else if (c == 0xFE) {
/* illegal UTF-8 code */
utfcount = 1;
unichar = 0;
} else if (c == 0xFF) {
/* illegal UTF-8 code */
utfcount = 1;
unichar = 0;
} else {
/* illegal UTF-8 sequence character */
utfcount = 1;
unichar = 0;
}
* length = utfcount;
utfcount --;
textpoi ++;
while (utfcount > 0 && (* textpoi & 0xC0) == 0x80) {
unichar = (unichar << 6) | (* textpoi & 0x3F);
utfcount --;
textpoi ++;
}
if (utfcount > 0) {
/* too short UTF-8 sequence */
unichar = 0;
* length -= utfcount;
}
* ucs = unichar;
}
void utftouni (wchar_t * wpoi, char * s)
{
unsigned long c;
int l;
int i = 0;
while (* s) {
utf8_info (s, & l, & wpoi [i++]);
s += l;
}
}
int main (int argc, char **argv)
{
char in[] = "The quick brown fox jumps over the lazy dog";
int line, i;
mbstate_t mbs;
size_t mbclen;
size_t size = sizeof (in);
wchar_t wc;
int lines = argc > 1 ? atoi (argv[1]) : 1000;
int do_malloc = 1;
int do_mbrtowc = 1;
int do_utftouni = 1;
if (argc > 2)
do_malloc = atoi (argv[2]);
if (argc > 3)
do_mbrtowc = atoi (argv[3]);
if (argc > 4)
do_utftouni = atoi (argv[4]);
printf ("with malloc: %d, with mbrtowc: %d, with utftouni: %d\n", do_malloc,
do_mbrtowc, do_utftouni);
memset (&mbs, 0, sizeof mbs);
for (line = 0; line < lines; ++line)
{
char *x;
if (do_malloc) x = malloc (size);
if (do_mbrtowc)
for (i = 0; i < size; i += mbclen)
if ((int)(mbclen = mbrtowc(&wc, in + i, size - i, &mbs)) <= 0)
break;
if (do_utftouni)
utftouni (&wc, in);
if (do_malloc) free (x);
}
return 0;
}