[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH 17/17] grep: match multibyte charsets line-by-line when using -i
From: |
Paolo Bonzini |
Subject: |
[PATCH 17/17] grep: match multibyte charsets line-by-line when using -i |
Date: |
Fri, 12 Mar 2010 18:49:18 +0100 |
The turtle combination -i + MB_CUR_MAX>1 requires case conversion ahead
of time. Avoid doing this repeatedly when many matches succeed. Together
with the previous changes, this fixes https://savannah.gnu.org/bugs/?29117
and https://savannah.gnu.org/bugs/?14472.
* src/grep.c (do_execute): New.
(grepbuf): Use it.
---
src/grep.c | 40 ++++++++++++++++++++++++++++++++++++++--
1 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/src/grep.c b/src/grep.c
index f1d341a..1f73c70 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1025,6 +1025,42 @@ prtext (char const *beg, char const *lim, int *nlinesp)
used = 1;
}
+EXECUTE_RET do_execute EXECUTE_ARGS
+{
+ const char *line_buf, *line_end, *line_next;
+ size_t result = (size_t) -1;
+
+ /* -i is a real turtle with multibyte character sts, so match
+ line-by-line.
+
+ FIXME: this is just an ugly workaround, and it doesn't really
+ belong here. Also, PCRE is always using this same per-line
+ matching algorithm. Either we fix -i, or we should refactor
+ this code---for example, we could adding another function pointer
+ to struct matcher to split the buffer passed to execute. It would
+ perform the memchr if line-by-line matching is necessary, or just
+ returns buf + size otherwise. */
+ if (MB_CUR_MAX == 1 || !match_icase)
+ return execute(buf, size, match_size, start_ptr);
+
+ for (line_next = buf; result == (size_t)-1 && line_next < buf + size; )
+ {
+ line_buf = line_next;
+ line_end = memchr (line_buf, eolbyte, (buf + size) - line_buf);
+ if (line_end == NULL)
+ line_next = line_end = buf + size;
+ else
+ line_next = line_end + 1;
+
+ if (start_ptr && start_ptr >= line_end)
+ continue;
+
+ result = execute (line_buf, line_next - line_buf, match_size, start_ptr);
+ }
+
+ return result == (size_t)-1 ? result : (line_buf - buf) + result;
+}
+
/* Scan the specified portion of the buffer, matching lines (or
between matching lines if OUT_INVERT is true). Return a count of
lines printed. */
@@ -1038,8 +1074,8 @@ grepbuf (char const *beg, char const *lim)
nlines = 0;
p = beg;
- while ((match_offset = execute(p, lim - p, &match_size,
- NULL)) != (size_t) -1)
+ while ((match_offset = do_execute(p, lim - p, &match_size,
+ NULL)) != (size_t) -1)
{
char const *b = p + match_offset;
char const *endp = b + match_size;
--
1.6.6
- [PATCH 12/17] dfa: speed up handling of brackets, (continued)
- [PATCH 12/17] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/12
- [PATCH 11/17] dfa: rewrite handling of multibyte case folding, Paolo Bonzini, 2010/03/12
- [PATCH 14/17] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/12
- [PATCH 15/17] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/12
- [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/12
- Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Norihiro Tanaka, 2010/03/13
- Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/14
- Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Norihiro Tanaka, 2010/03/14
- Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Paolo Bonzini, 2010/03/15
- Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match, Norihiro Tanaka, 2010/03/19
[PATCH 17/17] grep: match multibyte charsets line-by-line when using -i,
Paolo Bonzini <=
Re: [PATCH 00/16] my last hefty patch drop, Jim Meyering, 2010/03/12
Re: [PATCH 00/16] my last hefty patch drop, Paolo Bonzini, 2010/03/12
Re: [PATCH 00/16] my last hefty patch drop, Aharon Robbins, 2010/03/13