[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
grep 2.5.1: dfa matcher lost ability to handle embedded newlines
From: |
Aharon Robbins |
Subject: |
grep 2.5.1: dfa matcher lost ability to handle embedded newlines |
Date: |
Wed, 14 Jan 2004 15:03:06 +0200 |
Greetings.
With a new regex matcher in GLIBC, for gawk 3.1.2 and 3.1.3, I had
removed gawk's copy of the dfa matcher, since the new one supposedly
worked better than the old regex.c.
I found though, that the new one was much slower, and for my development
version, I am reinstating the use of the dfa matcher from grep into gawk.
In a sudden burst of energy, I decided to upgrade Gawk's version of
the dfa matcher to match what was in the current version of grep, 2.5.1.
In the process, I discovered that the 2.5.x version of dfa can no longer
match newlines embedded in the data to be matched. While this is OK
for {e,f,}grep, which are line-oriented, it's bad news for gawk.
I went through and manually merged in the changes from grep's dfa to
gawk's dfa. I found that I had a few I18N changes from Isamu that
never propogated into grep.
Once I had things working for my development version of gawk, I went
back into grep, and brought it into sync with my code. Below is a diff
that adds the few I18N changes, and also restores the newline-matching
ability. I modified search.c to call the new version.
After `./configure --with-included-regex', the modified version of grep
passes all the tests in `make test.'
I am bcc-ing the address listed for grep's current maintainer, in the hope
that this code is being maintained; I'm happy to work with you to keep
dfa.[hc] syncronized and usable for both of us.
Much thanks,
Arnold Robbins
----------- cut here ---------------
2004-01-14 Arnold D. Robbins <address@hidden>
Reinstated ability of dfa matcher to match embedded newlines. This is
needed for GNU awk (gawk).
* dfa.h (struct dfa): New member `newlines'.
(dfaexec): Revised prototype and comment.
* dfa.c (VMS): Include the right file.
(ENABLE_NLS): Do The Right Thing if not defined.
(xcalloc, xmalloc, xrealloc, prtok, tstbit, setbit, clrbit, copyset,
zeroset, notset, equal, charclass_index, looking_at, lex, addtok,
atom, nsubtoks, copytoks, closure, branch, copy, insert, merge,
delete, state_index, build_state, build_state_zero, icatalloc,
icpyalloc, istrstr, ifree, freelist, enlist, comsubs, addlists, inboth):
Add prototypes.
(fetch_wc, parse_bracket_exp_mb): I18N changes from Isamu, see gawk
ChangeLog.
(build_state, build_state_zero, realloc_trans_if_necessary, dfafree):
Handle `newlines' member.
(SKIP_REMAINS_MB_IF_INITIAL_STATE): Add various casts, return NULL.
(dfaexec): Protoype and matching loop revamped per earlier version of
dfa.c
(dfainit): Zero out `realtrans', `fails', `newlines' and `success'
members of *d.
* search.c (EGexecute): Revise call of `dfaexec' and pointer management.
--- dfa.h.orig 2001-03-07 06:11:27.000000000 +0200
+++ dfa.h 2004-01-14 13:48:48.000000000 +0200
@@ -1,5 +1,5 @@
/* dfa.h - declarations for GNU deterministic regexp compiler
- Copyright (C) 1988, 1998 Free Software Foundation, Inc.
+ Copyright (C) 1988, 1998, 2002 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -363,6 +363,13 @@
on a state that potentially could do so. */
int *success; /* Table of acceptance conditions used
in
dfaexec and computed in build_state. */
+ int *newlines; /* Transitions on newlines. The entry for a
+ newline in any transition table is always
+ -1 so we can count lines without wasting
+ too many cycles. The transition for a
+ newline is stored separately and handled
+ as a special case. Newline is also used
+ as a sentinel at the end of the buffer. */
struct dfamust *musts; /* List of strings, at least one of which
is known to appear in any r.e. matching
the dfa. */
@@ -397,13 +404,18 @@
extern void dfacomp PARAMS ((char const *, size_t, struct dfa *, int));
/* Execute the given struct dfa on the buffer of characters. The
- last byte of the buffer must equal the end-of-line byte.
- The final argument points to a flag that will
+ first char * points to the beginning, and the second points to the
+ first character after the end of the buffer, which must be a writable
+ place so a sentinel end-of-buffer marker can be stored there. The
+ second-to-last argument is a flag telling whether to allow newlines to
+ be part of a string matching the regexp. The next-to-last argument,
+ if non-NULL, points to a place to increment every time we see a
+ newline. The final argument, if non-NULL, points to a flag that will
be set if further examination by a backtracking matcher is needed in
order to verify backreferencing; otherwise the flag will be cleared.
- Returns (size_t) -1 if no match is found, or the offset of the first
+ Returns NULL if no match is found, or a pointer to the first
character after the first & shortest matching string in the buffer. */
-extern size_t dfaexec PARAMS ((struct dfa *, char const *, size_t, int *));
+extern char *dfaexec PARAMS ((struct dfa *, char const *, char *, int, int *,
int *));
/* Free the storage held by the components of a struct dfa. */
extern void dfafree PARAMS ((struct dfa *));
--- dfa.c.orig 2001-09-26 18:57:55.000000000 +0200
+++ dfa.c 2004-01-14 14:40:30.000000000 +0200
@@ -1,5 +1,5 @@
/* dfa.c - deterministic extended regexp routines for GNU
- Copyright 1988, 1998, 2000 Free Software Foundation, Inc.
+ Copyright 1988, 1998, 2000, 2002 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -26,7 +26,11 @@
#include <ctype.h>
#include <stdio.h>
+#ifndef VMS
#include <sys/types.h>
+#else
+#include <stddef.h>
+#endif
#ifdef STDC_HEADERS
#include <stdlib.h>
#else
@@ -99,8 +103,10 @@
host does not conform to Posix. */
#define ISASCIIDIGIT(c) ((unsigned) (c) - '0' <= 9)
+/* Don't use gettext if ENABLE_NLS is not defined */
/* If we (don't) have I18N. */
/* glibc defines _ */
+#ifdef ENABLE_NLS
#ifndef _
# ifdef HAVE_LIBINTL_H
# include <libintl.h>
@@ -111,6 +117,9 @@
# define _(Str) (Str)
# endif
#endif
+#else
+# define _(Str) (Str)
+#endif
#include "regex.h"
#include "dfa.h"
@@ -125,7 +134,47 @@
#endif
static void dfamust PARAMS ((struct dfa *dfa));
+
+static ptr_t xcalloc PARAMS ((size_t n, size_t s));
+static ptr_t xmalloc PARAMS ((size_t n));
+static ptr_t xrealloc PARAMS ((ptr_t p, size_t n));
+#ifdef DEBUG
+static void prtok PARAMS ((token t));
+#endif
+static int tstbit PARAMS ((unsigned b, charclass c));
+static void setbit PARAMS ((unsigned b, charclass c));
+static void clrbit PARAMS ((unsigned b, charclass c));
+static void copyset PARAMS ((charclass src, charclass dst));
+static void zeroset PARAMS ((charclass s));
+static void notset PARAMS ((charclass s));
+static int equal PARAMS ((charclass s1, charclass s2));
+static int charclass_index PARAMS ((charclass s));
+static int looking_at PARAMS ((const char *s));
+static token lex PARAMS ((void));
+static void addtok PARAMS ((token t));
+static void atom PARAMS ((void));
+static int nsubtoks PARAMS ((int tindex));
+static void copytoks PARAMS ((int tindex, int ntokens));
+static void closure PARAMS ((void));
+static void branch PARAMS ((void));
static void regexp PARAMS ((int toplevel));
+static void copy PARAMS ((position_set const *src, position_set *dst));
+static void insert PARAMS ((position p, position_set *s));
+static void merge PARAMS ((position_set const *s1, position_set const *s2,
position_set *m));
+static void delete PARAMS ((position p, position_set *s));
+static int state_index PARAMS ((struct dfa *d, position_set const *s,
+ int newline, int letter));
+static void build_state PARAMS ((int s, struct dfa *d));
+static void build_state_zero PARAMS ((struct dfa *d));
+static char *icatalloc PARAMS ((char *old, char *new));
+static char *icpyalloc PARAMS ((char *string));
+static char *istrstr PARAMS ((char *lookin, char *lookfor));
+static void ifree PARAMS ((char *cp));
+static void freelist PARAMS ((char **cpp));
+static char **enlist PARAMS ((char **cpp, char *new, size_t len));
+static char **comsubs PARAMS ((char *left, char *right));
+static char **addlists PARAMS ((char **old, char **new));
+static char **inboth PARAMS ((char **left, char **right));
static ptr_t
xcalloc (size_t n, size_t s)
@@ -414,7 +463,7 @@
/* This function fetch a wide character, and update cur_mb_len,
used only if the current locale is a multibyte environment. */
-static wchar_t
+static wint_t
fetch_wc (char const *eoferr)
{
wchar_t wc;
@@ -423,7 +472,7 @@
if (eoferr != 0)
dfaerror (eoferr);
else
- return -1;
+ return WEOF;
}
cur_mb_len = mbrtowc(&wc, lexptr, lexleft, &mbs);
@@ -459,7 +508,7 @@
static void
parse_bracket_exp_mb ()
{
- wchar_t wc, wc1, wc2;
+ wint_t wc, wc1, wc2;
/* Work area to build a mb_char_classes. */
struct mb_char_classes *work_mbc;
@@ -469,7 +518,7 @@
REALLOC_IF_NECESSARY(dfa->mbcsets, struct mb_char_classes,
dfa->mbcsets_alloc, dfa->nmbcsets + 1);
/* dfa->multibyte_prop[] hold the index of dfa->mbcsets.
- We will update dfa->multibyte_prop in addtok(), because we can't
+ We will update dfa->multibyte_prop[] in addtok(), because we can't
decide the index in dfa->tokens[]. */
/* Initialize work are */
@@ -496,7 +545,7 @@
work_mbc->invert = 0;
do
{
- wc1 = -1; /* mark wc1 is not initialized". */
+ wc1 = WEOF; /* mark wc1 is not initialized". */
/* Note that if we're looking at some other [:...:] construct,
we just treat it as a bunch of ordinary characters. We can do
@@ -551,7 +600,7 @@
wt = wctype (str);
if (ch_classes_al == 0)
- MALLOC(work_mbc->ch_classes, wchar_t, ++ch_classes_al);
+ MALLOC(work_mbc->ch_classes, wctype_t, ++ch_classes_al);
REALLOC_IF_NECESSARY(work_mbc->ch_classes, wctype_t,
ch_classes_al,
work_mbc->nch_classes + 1);
@@ -586,7 +635,7 @@
work_mbc->coll_elems[work_mbc->ncoll_elems++] = elem;
}
}
- wc = -1;
+ wc = WEOF;
}
else
/* We treat '[' as a normal character here. */
@@ -600,7 +649,7 @@
wc = fetch_wc(("Unbalanced ["));
}
- if (wc1 == -1)
+ if (wc1 == WEOF)
wc1 = fetch_wc(_("Unbalanced ["));
if (wc1 == L'-')
@@ -630,17 +679,17 @@
}
REALLOC_IF_NECESSARY(work_mbc->range_sts, wchar_t,
range_sts_al, work_mbc->nranges + 1);
- work_mbc->range_sts[work_mbc->nranges] = wc;
+ work_mbc->range_sts[work_mbc->nranges] = (wchar_t)wc;
REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
range_ends_al, work_mbc->nranges + 1);
- work_mbc->range_ends[work_mbc->nranges++] = wc2;
+ work_mbc->range_ends[work_mbc->nranges++] = (wchar_t)wc2;
}
- else if (wc != -1)
+ else if (wc != WEOF)
/* build normal characters. */
{
REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
work_mbc->nchars + 1);
- work_mbc->chars[work_mbc->nchars++] = wc;
+ work_mbc->chars[work_mbc->nchars++] = (wchar_t)wc;
}
}
while ((wc = wc1) != L']');
@@ -1130,7 +1179,7 @@
: (((cur_mb_index == 1)? 1 : 0) /* 1st-byte of multibyte char */
+ ((cur_mb_index == cur_mb_len)? 2 : 0)); /* last-byte */
else
- /* It may be unnecesssary, but it is safer to treat other
+ /* It may be unnecessary, but it is safer to treat other
symbols as singlebyte characters. */
dfa->multibyte_prop[dfa->tindex] = 3;
}
@@ -1951,7 +2000,7 @@
int state_letter; /* New state on a letter transition. */
static int initialized; /* Flag for static initialization. */
#ifdef MBS_SUPPORT
- int next_isnt_1st_byte = 0; /* Flag If we can't add state0. */
+ int next_isnt_1st_byte = 0; /* Flag if we can't add state0. */
#endif
int i, j, k;
@@ -2283,6 +2332,7 @@
d->trans = d->realtrans + 1;
REALLOC(d->fails, int *, d->tralloc);
REALLOC(d->success, int, d->tralloc);
+ REALLOC(d->newlines, int, d->tralloc);
while (oldalloc < d->tralloc)
{
d->trans[oldalloc] = NULL;
@@ -2290,7 +2340,9 @@
}
}
- /* Newline is a sentinel. */
+ /* Keep the newline transition in a special place so we can use it as
+ a sentinel. */
+ d->newlines[s] = trans[eolbyte];
trans[eolbyte] = -1;
if (ACCEPTING(s, *d))
@@ -2308,6 +2360,7 @@
d->trans = d->realtrans + 1;
CALLOC(d->fails, int *, d->tralloc);
MALLOC(d->success, int, d->tralloc);
+ MALLOC(d->newlines, int, d->tralloc);
build_state(0, d);
}
@@ -2326,13 +2379,13 @@
{ \
while (inputwcs[p - buf_begin] == 0 \
&& mblen_buf[p - buf_begin] > 0 \
- && p < buf_end) \
+ && (unsigned char const *)p < buf_end) \
++p; \
- if (p >= end) \
+ if ((char *)p >= end) \
{ \
free(mblen_buf); \
free(inputwcs); \
- return (size_t) -1; \
+ return NULL; \
} \
}
@@ -2351,6 +2404,7 @@
d->trans = d->realtrans + 1;
REALLOC(d->fails, int *, d->tralloc);
REALLOC(d->success, int, d->tralloc);
+ REALLOC(d->newlines, int, d->tralloc);
while (oldalloc < d->tralloc)
{
d->trans[oldalloc] = NULL;
@@ -2739,19 +2793,23 @@
/* Search through a buffer looking for a match to the given struct dfa.
Find the first occurrence of a string matching the regexp in the buffer,
- and the shortest possible version thereof. Return the offset of the first
- character after the match, or (size_t) -1 if none is found. BEGIN points to
- the beginning of the buffer, and SIZE is the size of the buffer. If SIZE
- is nonzero, BEGIN[SIZE - 1] must be a newline. BACKREF points to a place
+ and the shortest possible version thereof. Return a pointer to the first
+ character after the match, or NULL if none is found. Begin points to
+ the beginning of the buffer, and end points to the first character after
+ its end. We store a newline in *end to act as a sentinel, so end had
+ better point somewhere valid. Newline is a flag indicating whether to
+ allow newlines to be in the matching string. If count is non-
+ NULL it points to a place we're supposed to increment every time we
+ see a newline. Finally, if backref is non-NULL it points to a place
where we're supposed to store a 1 if backreferencing happened and the
match needs to be verified by a backtracking matcher. Otherwise
we store a 0 in *backref. */
-size_t
-dfaexec (struct dfa *d, char const *begin, size_t size, int *backref)
+char *
+dfaexec (struct dfa *d, char const *begin, char *end,
+ int newline, int *count, int *backref)
{
- register int s; /* Current state. */
+ register int s, s1, tmp; /* Current state. */
register unsigned char const *p; /* Current input character. */
- register unsigned char const *end; /* One past the last input character. */
register int **trans, *t; /* Copy of d->trans so it can be optimized
into a register. */
register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
@@ -2771,10 +2829,10 @@
if (! d->tralloc)
build_state_zero(d);
- s = 0;
+ s = s1 = 0;
p = (unsigned char const *) begin;
- end = p + size;
trans = d->trans;
+ *end = eol;
#ifdef MBS_SUPPORT
if (MB_CUR_MAX > 1)
@@ -2784,17 +2842,16 @@
buf_end = end;
/* initialize mblen_buf, and inputwcs. */
- MALLOC(mblen_buf, unsigned char, end - (unsigned char const *)begin + 2);
- MALLOC(inputwcs, wchar_t, end - (unsigned char const *)begin + 2);
+ MALLOC(mblen_buf, unsigned char, end - begin + 2);
+ MALLOC(inputwcs, wchar_t, end - begin + 2);
memset(&mbs, 0, sizeof(mbstate_t));
remain_bytes = 0;
- for (i = 0; i < end - (unsigned char const *)begin + 1; i++)
+ for (i = 0; i < end - begin + 1; i++)
{
if (remain_bytes == 0)
{
remain_bytes
- = mbrtowc(inputwcs + i, begin + i,
- end - (unsigned char const *)begin - i + 1, &mbs);
+ = mbrtowc(inputwcs + i, begin + i, end - begin - i + 1, &mbs);
if (remain_bytes <= 1)
{
remain_bytes = 0;
@@ -2825,6 +2882,9 @@
if (MB_CUR_MAX > 1)
while ((t = trans[s]))
{
+ if ((char *) p > end)
+ break;
+ s1 = s;
if (d->states[s].mbps.nelem != 0)
{
/* Can match with a multibyte character( and multi character
@@ -2835,7 +2895,7 @@
nextp = p;
s = transit_state(d, s, &nextp);
- p = nextp;
+ p = (unsigned char *)nextp;
/* Trans table might be updated. */
trans = d->trans;
@@ -2848,25 +2908,16 @@
}
else
#endif /* MBS_SUPPORT */
- while ((t = trans[s]))
- s = t[*p++];
-
- if (s < 0)
- {
- if (p == end)
- {
-#ifdef MBS_SUPPORT
- if (MB_CUR_MAX > 1)
- {
- free(mblen_buf);
- free(inputwcs);
- }
-#endif /* MBS_SUPPORT */
- return (size_t) -1;
+ while ((t = trans[s]) != 0) { /* hand-optimized loop */
+ s1 = t[*p++];
+ if ((t = trans[s1]) == 0) {
+ tmp = s ; s = s1 ; s1 = tmp ; /* swap */
+ break;
}
- s = 0;
+ s = t[*p++];
}
- else if ((t = d->fails[s]))
+
+ if (s >= 0 && p <= (unsigned char *) end && d->fails[s])
{
if (d->success[s] & sbit[*p])
{
@@ -2879,37 +2930,58 @@
free(inputwcs);
}
#endif /* MBS_SUPPORT */
- return (char const *) p - begin;
+ return (char *) p;
}
+ s1 = s;
#ifdef MBS_SUPPORT
if (MB_CUR_MAX > 1)
{
- SKIP_REMAINS_MB_IF_INITIAL_STATE(s, p);
- if (d->states[s].mbps.nelem != 0)
- {
- /* Can match with a multibyte character( and multi
- character collating element). */
unsigned char const *nextp;
nextp = p;
s = transit_state(d, s, &nextp);
- p = nextp;
+ p = (unsigned char *)nextp;
/* Trans table might be updated. */
trans = d->trans;
}
else
- s = t[*p++];
+#endif /* MBS_SUPPORT */
+ s = d->fails[s][*p++];
+ continue;
+ }
+
+ /* If the previous character was a newline, count it. */
+ if (count && (char *) p <= end && p[-1] == eol)
+ ++*count;
+
+ /* Check if we've run off the end of the buffer. */
+ if ((char *) p > end)
+ {
+#ifdef MBS_SUPPORT
+ if (MB_CUR_MAX > 1)
+ {
+ free(mblen_buf);
+ free(inputwcs);
}
- else
#endif /* MBS_SUPPORT */
- s = t[*p++];
+ return NULL;
}
- else
+
+ if (s >= 0)
{
build_state(s, d);
trans = d->trans;
+ continue;
+ }
+
+ if (p[-1] == eol && newline)
+ {
+ s = d->newlines[s1];
+ continue;
}
+
+ s = 0;
}
}
@@ -2940,6 +3012,10 @@
d->tralloc = 0;
d->musts = 0;
+ d->realtrans = 0;
+ d->fails = 0;
+ d->newlines = 0;
+ d->success = 0;
}
/* Parse and analyze a single string of the given length. */
@@ -3036,6 +3112,7 @@
free((ptr_t) d->fails[i]);
if (d->realtrans) free((ptr_t) d->realtrans);
if (d->fails) free((ptr_t) d->fails);
+ if (d->newlines) free((ptr_t) d->newlines);
if (d->success) free((ptr_t) d->success);
for (dm = d->musts; dm; dm = ndm)
{
--- search.c.orig 2001-04-19 06:42:14.000000000 +0300
+++ search.c 2004-01-14 14:42:56.000000000 +0200
@@ -335,7 +335,8 @@
static size_t
EGexecute (char const *buf, size_t size, size_t *match_size, int exact)
{
- register char const *buflim, *beg, *end;
+ register char const *beg;
+ char *buflim, *end, save;
char eol = eolbyte;
int backref, start, len;
struct kwsmatch kwsm;
@@ -349,9 +350,9 @@
mb_properties = check_multibyte_string(buf, size);
#endif /* MBS_SUPPORT */
- buflim = buf + size;
+ buflim = (char *)buf + size;
- for (beg = end = buf; end < buflim; beg = end)
+ for (beg = end = (char *)buf; end < buflim; beg = end + 1)
{
if (!exact)
{
@@ -371,6 +372,8 @@
/* Narrow down to the line containing the candidate, and
run it through DFA. */
end = memchr(beg, eol, buflim - beg);
+ if (!end)
+ end = buflim;
end++;
#ifdef MBS_SUPPORT
if (MB_CUR_MAX > 1 && mb_properties[beg - buf] == 0)
@@ -378,20 +381,28 @@
#endif
while (beg > buf && beg[-1] != eol)
--beg;
+ save = *end;
if (kwsm.index < kwset_exact_matches)
goto success;
- if (dfaexec (&dfa, beg, end - beg, &backref) == (size_t) -1)
- continue;
+ if (dfaexec(&dfa, beg, end, 0, (int *) 0, &backref) == NULL)
+ {
+ *end = save;
+ continue;
+ }
+ *end = save;
}
else
{
/* No good fixed strings; start with DFA. */
- size_t offset = dfaexec (&dfa, beg, buflim - beg, &backref);
- if (offset == (size_t) -1)
+ save = *buflim;
+ beg = dfaexec(&dfa, beg, buflim, 0, (int *) 0, &backref);
+ *buflim = save;
+ if (!beg)
break;
/* Narrow down to the line we've found. */
- beg += offset;
end = memchr (beg, eol, buflim - beg);
+ if (!end)
+ end = buflim;
end++;
while (beg > buf && beg[-1] != eol)
--beg;
@@ -401,7 +412,7 @@
goto success;
}
else
- end = beg + size;
+ end = (char *)beg + size;
/* If we've made it to this point, this means DFA has seen
a probable match, and we need to run it through Regex. */
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- grep 2.5.1: dfa matcher lost ability to handle embedded newlines,
Aharon Robbins <=