[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Patch] faster fnmatch
From: |
Bruno Haible |
Subject: |
Re: [Patch] faster fnmatch |
Date: |
Fri, 1 May 2009 14:51:48 +0200 |
User-agent: |
KMail/1.9.9 |
Hello Ondrej,
Nice work already!
> *abc*
> en_US.UTF-8
> user: 2.71
> user: 0.23
> ...
> *ab[a-z]*
> en_US.UTF-8
> user: 2.85
> user: 3.32
> C
> user: 0.86
> user: 1.25
Hmm, well, users are expecting a speedup through your module always. If it
some (not too rare) cases the glibc fnmatch is faster, it's difficult to
convince people to use your function.
> Still I assume UTF8 or singlebyte encoding otherwise I fallback.
The general and hard case is the multibyte encoding case, where the encoding
is something like GB18030, BIG5, or similar. Single-byte encodings may be
worth optimizing for, because the C locale on glibc systems uses a single-byte
encoding. Whether UTF-8 is worth special-casing, depends on the complexity.
> I could generalize this to wide characters but if I dont know any character
> is single wchar I cannot do much (UTF16 wchar in windows).
It's OK to pretend that every wchar_t is a single character. I.e. you can
assume that when you use mbstowcs on a string, you get an array of characters.
Yes, on Windows, wchar_t is UTF-16 probably, but in practice this does not
make much of a difference, because the Unicode characters in the range
U+10000..U+10FFFF are not among the "interesting" characters (patterns that
contain such a character inside brackets are rare) and because they occur
rarely enough.
> + This program is free software; you can redistribute it and/or modify
> + it under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 2, or (at your option)
> + any later version.
> +
> + This program is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + GNU General Public License for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with this program; if not, write to the Free Software Foundation,
> + Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */
For code in gnulib, please use the same copyright header as in many other
files, namely GPLv3+ with a pointer to the FSF's web site rather than with
its postal address.
> +int fnmatch_exec (fnmatch_state * pattern, const char *string);
Could this function prototype also be
int fnmatch_exec (const fnmatch_state * pattern, const char *string);
? If not, this means fnmatch_exec modifies the compiled state, which
makes it not multithread-safe to use the same fnmatch_state in multiple
threads?
> +struct _fnmatch_state;
> +typedef struct _fnmatch_state fnmatch_state;
Is it really a "state", i.e. a data structure that is modified over and
over again? I would have expected a compiled pattern data structure, that
is created by fnmatch_compile and then left unchanged until fnmatch_free.
> +#include <stdint.h>
> +
> +#define _GNU_SOURCE
"#define GNU_SOURCE" has no effect after a system include file such as
<stdint.h>
was already included. Also, "#define _GNU_SOURCE" has no effect on Solaris.
For this reason, gnulib has a module 'extensions' which does the right thing
(and which you're already using).
> +#define CHARSPERINT 4
For better portability, write this as
#define CHARSPERINT (sizeof (int) / sizeof (char))
> + uint32_t hash[8];
For better portability, write (UCHAR_MAX / 32) + 1 instead of 8.
> +int
> +ascii_compatible_encoding (char *c)
Could this function be 'static'? And take a 'const char *' argument?
> +void
> +strfold (const char *str, char *buf)
> +{
> + while (*str)
> + *buf++ = tolower (*str++);
tolower() is only useful in single-byte locales. For support of
multibyte locales, gnulib offers the functions mbmemcasecmp, mbscasecmp,
mbspcasecmp.
> + for (pass = 0; pass < 2; pass++)
> + { // two passes in first we compute values
> we use
> + patsize = size; // in second to optimalize.
Not all C compilers support ISO C 99 / C++ style comments. Better use the old
style of comments, even if it means more typing.
> + case '[':;
> + int siz = parse_bracket (s + 1, pat, flags);
Declarations after statements also are only a C99 / C++ feature. For ANSI C,
you need to put braces around this block.
Like Ralf, I have no read through most of the code yet.
Bruno