[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
From: |
Paolo Bonzini |
Subject: |
[PATCH 9/9] grep: match multibyte charsets line-by-line when using -i |
Date: |
Sun, 14 Mar 2010 16:35:14 +0100 |
The turtle combination -i + MB_CUR_MAX>1 requires case conversion ahead
of time. Avoid doing this repeatedly when many matches succeed. Together
with the previous changes, this fixes https://savannah.gnu.org/bugs/?29117
and https://savannah.gnu.org/bugs/?14472.
* NEWS: Document new speedup.
* src/grep.c (do_execute): New.
(grepbuf): Use it.
---
NEWS | 6 ++++++
src/grep.c | 40 ++++++++++++++++++++++++++++++++++++++--
2 files changed, 44 insertions(+), 2 deletions(-)
diff --git a/NEWS b/NEWS
index a2db324..1a88d21 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,12 @@ GNU grep NEWS -*- outline
-*-
* Noteworthy changes in release ?.? (????-??-??) [?]
+** Speed improvements
+
+ grep is much faster on multibyte character sets, especially (but not
+ limited to) UTF-8 character sets. The speed improvement is also very
+ pronounced with case-insensitive matches.
+
** Bug fixes
grep -i with a character class would malfunction in multi-byte locales.
diff --git a/src/grep.c b/src/grep.c
index f1d341a..19e04e2 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1025,6 +1025,42 @@ prtext (char const *beg, char const *lim, int *nlinesp)
used = 1;
}
+static EXECUTE_RET do_execute EXECUTE_ARGS
+{
+ const char *line_buf, *line_end, *line_next;
+ size_t result = (size_t) -1;
+
+ /* -i is a real turtle with multibyte character sts, so match
+ line-by-line.
+
+ FIXME: this is just an ugly workaround, and it doesn't really
+ belong here. Also, PCRE is always using this same per-line
+ matching algorithm. Either we fix -i, or we should refactor
+ this code---for example, we could adding another function pointer
+ to struct matcher to split the buffer passed to execute. It would
+ perform the memchr if line-by-line matching is necessary, or just
+ returns buf + size otherwise. */
+ if (MB_CUR_MAX == 1 || !match_icase)
+ return execute(buf, size, match_size, start_ptr);
+
+ for (line_next = buf; result == (size_t)-1 && line_next < buf + size; )
+ {
+ line_buf = line_next;
+ line_end = memchr (line_buf, eolbyte, (buf + size) - line_buf);
+ if (line_end == NULL)
+ line_next = line_end = buf + size;
+ else
+ line_next = line_end + 1;
+
+ if (start_ptr && start_ptr >= line_end)
+ continue;
+
+ result = execute (line_buf, line_next - line_buf, match_size, start_ptr);
+ }
+
+ return result == (size_t)-1 ? result : (line_buf - buf) + result;
+}
+
/* Scan the specified portion of the buffer, matching lines (or
between matching lines if OUT_INVERT is true). Return a count of
lines printed. */
@@ -1038,8 +1074,8 @@ grepbuf (char const *beg, char const *lim)
nlines = 0;
p = beg;
- while ((match_offset = execute(p, lim - p, &match_size,
- NULL)) != (size_t) -1)
+ while ((match_offset = do_execute(p, lim - p, &match_size,
+ NULL)) != (size_t) -1)
{
char const *b = p + match_offset;
char const *endp = b + match_size;
--
1.6.6.1
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, (continued)
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
- [PATCH 8/9] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
- [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i,
Paolo Bonzini <=