bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RFC] Add prototype pure-DFA matcher


From: Paolo Bonzini
Subject: [RFC] Add prototype pure-DFA matcher
Date: Sat, 5 Dec 2009 00:51:03 +0100

Hi all, this is an experiment in adding a pure DFA matcher to gnulib.
The code is basically taken from regexec.c, but butchered to eliminate
registers and backreferences (3000 lines of code less...).  I started
it just after looking at the depressing list of bugs in GNU grep's
dfa.c.  Since deleting code is much easier than writing it, it took me
an evening to produce what I am posting---a good part of it spent
writing the example below as well as this message.

The only missing functionality from dfa.c is AFAIK the code to extract
"must be present" words.  Just like GNU grep's dfa.c, if a backreference
is found the matcher returns an error (REG_ESUBREG) and the caller should
revert to the full-blown matcher.

Since it can eliminate some of the processing, this matcher can be a
bit faster than regexec.c, though the inner loop is unfortunately not
as sweet as the well known DFA loop, which could be something like

  if (!s || *p == '\n')
    return;
  if (!s->trtable)
    s->trtable = build_trtable (s);
  s = s->trtable[*p++];

However, there is time for improvements later.  Right now, I've focused on
keeping the code as similar as the one in regexec.c.  The only function
that had a non-trivial amount of churn is check_matching (plus of course
the new function re_search_dfa).

This is in no way appropriate for inclusion now.  In particular, it would
be necessary to have the possibility of including this matcher but keep
the NFA matcher from glibc (since this one does not have full-blown i18n
support that is present in glibc).

Also, the changes to lib/regcomp.c and lib/regexec.c (which are just an
optimization) should be separated and posted to glibc.  Another thing
to be done would be to support collation symbols like backreferences
(i.e. tell the caller to fallback to glibc).  This is a simple patch to
lib/regcomp.c, but I haven't done it yet.

I'm posting this for comments, ideas, and for someone to pick up and
finish the work.  Bruno may also like to look at the code for his
libunistring.  If he can get the code to work without the re_string
APIs and directly on UTF-8, that will also be quite a bit faster and
it would be nice to use in GNU grep.

* lib/regcomp.c (optimize_utf8, parse_expression): Do not set has_mb_node
for backreferences.
* lib/regexec.c (re_search_internal): Look at nbackref in addition to
has_mb_node.

* lib/regex.h (re_search_dfa): New.
* lib/regex.c [HAVE_REGEX_DFA]: Include it.
* lib/regdfa.c: New.
* m4/regex.m4 (gl_REGEX): Pick ac_use_included_regex from prolog.
* m4/regexdfa.m4: New.
* modules/regexdfa: New.
---
        Here is an example of using the new API:

        #include "config.h"
        #include "string.h"
        #include <limits.h>
        #include <regex.h>
                   
        int main()
        {
          static struct re_pattern_buffer regex;
          int i;
          const char *s;
        
          /* This should fail.  */
          re_set_syntax (RE_SYNTAX_EGREP);
          memset (&regex, 0, sizeof regex);
          s = re_compile_pattern ("a[^x]b", 6, &regex);
          if (s)
            return 2;
          if (re_search_dfa (&regex, "a\nb", 3, 0, 3, 3, 0) != REG_NOMATCH)
            return 1;
        
          /* This should pass.  */
          re_set_syntax (RE_SYNTAX_EGREP & ~RE_HAT_LISTS_NOT_NEWLINE);
          memset (&regex, 0, sizeof regex);
          s = re_compile_pattern ("a[^x]b", 6, &regex);
          if (s)
            return 2;
          if (re_search_dfa (&regex, "xyza\nb", 6, 0, 6, 6, 0) != REG_NOERROR)
            return 1;
        
          /* Test backreference (lack of) support.  */
          memset (&regex, 0, sizeof regex);
          s = re_compile_pattern ("(a)(\\1|$)", 9, &regex);
          if (s)
            return 2;
        
          if (re_search_dfa (&regex, "ca", 2, 0, 2, 2, 0) != REG_NOERROR)
            return 1;
          if (re_search_dfa (&regex, "aab", 3, 0, 3, 3, 0) != REG_ESUBREG)
            return 1;
          if (re_search_dfa (&regex, "aa", 2, 0, 2, 2, 0) != REG_NOERROR)
            return 1;
          if (re_search_dfa (&regex, "cc", 2, 0, 2, 2, 0) != REG_NOMATCH)
            return 1;
        
          /* Test backreference (lack of) support.  */
          memset (&regex, 0, sizeof regex);
          s = re_compile_pattern ("(a)(\\1|b)", 9, &regex);
          if (s)
            return 2;
        
          if (re_search_dfa (&regex, "cab", 3, 0, 3, 3, 0) == REG_NOMATCH)
            return 1;
          if (re_search_dfa (&regex, "aab", 3, 0, 3, 3, 0) == REG_NOMATCH)
            return 1;
          if (re_search_dfa (&regex, "aa", 2, 0, 2, 2, 0) != REG_ESUBREG)
            return 1;
          if (re_search_dfa (&regex, "aac", 3, 0, 3, 3, 0) != REG_ESUBREG)
            return 1;
        
          /* regdfa.c is not restricted to leftmost matches!  */
          memset (&regex, 0, sizeof regex);
          s = re_compile_pattern ("((a)\\2)|b", 9, &regex);
          if (s)
            return 2;
        
          if (re_search_dfa (&regex, "cab", 3, 0, 3, 3, 0) != REG_NOERROR)
            return 1;
          if (re_search_dfa (&regex, "aab", 3, 0, 3, 3, 0) != REG_NOERROR)
            return 1;
        
          /* The version of regex.c in older versions of gnulib
             ignored RE_ICASE.  Detect that problem too.  */
          re_set_syntax (RE_SYNTAX_EMACS | RE_ICASE);
          memset (&regex, 0, sizeof regex);
          s = re_compile_pattern ("x", 1, &regex);
          if (s)
            return 2;
        
          if (re_search_dfa (&regex, "WXY", 3, 0, 3, 3, 0) != REG_NOERROR)
            return 1;
        
          return 0;
        }

 lib/regcomp.c    |    3 +-
 lib/regdfa.c     | 1399 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/regex.c      |    4 +
 lib/regex.h      |   12 +
 lib/regexec.c    |    2 +-
 m4/regex.m4      |    9 +-
 m4/regexdfa.m4   |   17 +
 modules/regexdfa |   23 +
 8 files changed, 1463 insertions(+), 6 deletions(-)
 create mode 100644 lib/regdfa.c
 create mode 100644 m4/regexdfa.m4
 create mode 100644 modules/regexdfa

diff --git a/lib/regcomp.c b/lib/regcomp.c
index 6aef405..e6c5ab9 100644
--- a/lib/regcomp.c
+++ b/lib/regcomp.c
@@ -1127,7 +1127,7 @@ optimize_utf8 (re_dfa_t *dfa)
   /* The search can be in single byte locale.  */
   dfa->mb_cur_max = 1;
   dfa->is_utf8 = 0;
-  dfa->has_mb_node = dfa->nbackref > 0 || has_period;
+  dfa->has_mb_node = has_period;
 }
 #endif
 
@@ -2268,7 +2268,6 @@ parse_expression (re_string_t *regexp, regex_t *preg, 
re_token_t *token,
          return NULL;
        }
       ++dfa->nbackref;
-      dfa->has_mb_node = 1;
       break;
     case OP_OPEN_DUP_NUM:
       if (syntax & RE_CONTEXT_INVALID_DUP)
diff --git a/lib/regdfa.c b/lib/regdfa.c
new file mode 100644
index 0000000..2460fdf
--- /dev/null
+++ b/lib/regdfa.c
@@ -0,0 +1,1399 @@
+/* Extended regular expression matching and search library.
+   Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009
+   Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Isamu Hasegawa <address@hidden>.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License along
+   with this program; if not, write to the Free Software Foundation,
+   Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */
+
+#define check_matching check_matching_dfa
+#define check_halt_state_context check_halt_state_context_dfa
+#define find_recover_state find_recover_state_dfa
+#define transit_state transit_state_dfa
+#define merge_state_with_log merge_state_with_log_dfa
+#define transit_state_mb transit_state_mb_dfa
+#define build_trtable build_trtable_dfa
+#define check_node_accept_bytes check_node_accept_bytes_dfa
+#define group_nodes_into_DFAstates group_nodes_into_DFAstates_dfa
+#define extend_buffers extend_buffers_dfa
+#define acquire_init_state_context acquire_init_state_context_dfa
+#define check_halt_node_context check_halt_node_context_dfa
+#define clean_state_log_if_needed clean_state_log_if_needed_dfa
+
+static reg_errcode_t check_matching (re_match_context_t *mctx) 
internal_function;
+static Idx check_halt_state_context (const re_match_context_t *mctx,
+                                    const re_dfastate_t *state, Idx idx)
+     internal_function;
+#ifdef RE_ENABLE_I18N
+static re_dfastate_t *find_recover_state (reg_errcode_t *err,
+                                        re_match_context_t *mctx) 
internal_function;
+#endif
+static re_dfastate_t *transit_state (reg_errcode_t *err,
+                                    re_match_context_t *mctx,
+                                    re_dfastate_t *state) internal_function;
+#ifdef RE_ENABLE_I18N
+static re_dfastate_t *merge_state_with_log (reg_errcode_t *err,
+                                           re_match_context_t *mctx,
+                                           re_dfastate_t *next_state)
+     internal_function;
+static reg_errcode_t transit_state_mb (re_match_context_t *mctx,
+                                      re_dfastate_t *pstate)
+     internal_function;
+#endif /* RE_ENABLE_I18N */
+static bool build_trtable (const re_dfa_t *dfa,
+                          re_dfastate_t *state) internal_function;
+#ifdef RE_ENABLE_I18N
+static int check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx,
+                                   const re_string_t *input, Idx idx)
+     internal_function;
+#endif /* RE_ENABLE_I18N */
+static Idx group_nodes_into_DFAstates (const re_dfa_t *dfa,
+                                      const re_dfastate_t *state,
+                                      re_node_set *states_node,
+                                      bitset_t *states_ch) internal_function;
+static reg_errcode_t extend_buffers (re_match_context_t *mctx)
+     internal_function;
+
+/* re_search_dfa() first tries matching at index START, then it tries to match
+   starting from index START + 1, and so on.  The last start position tried
+   is START + RANGE.  (Thus RANGE = 0 forces re_search to anchor the match).
+
+   The parameter STOP of re_{match,search}_2 specifies that no match exceeding
+   the first STOP characters of the concatenation of the strings should be
+   concerned.
+
+   If REGS is not NULL, and BUFP->no_sub is not set, the offsets of the match
+   and all groups is stored in REGS.  (For the "_2" variants, the offsets are
+   computed relative to the concatenation, not relative to the individual
+   strings.)
+
+   On success, re_search_dfa returns 0.  Other return values means no
+   match was found, a backreference was encountered, or an internal error.  */
+
+reg_errcode_t
+re_search_dfa (struct re_pattern_buffer *bufp,
+              const char *string, Idx length,
+                      Idx start, regoff_t range, Idx stop,
+                      int eflags)
+{
+  const regex_t *preg = bufp;
+  reg_errcode_t result;
+  re_dfa_t *dfa = (re_dfa_t *) bufp->buffer;
+  Idx last_start = start + range;
+  reg_errcode_t err, rval;
+  Idx left_lim, right_lim;
+  int incr;
+  int match_kind;
+  Idx match_first;
+  bool sb;
+  int ch;
+  char *fastmap;
+  RE_TRANSLATE_TYPE t = preg->translate;
+
+#if defined _LIBC || (defined __STDC_VERSION__ && __STDC_VERSION__ >= 199901L)
+  re_match_context_t mctx = { .dfa = dfa };
+#else
+  re_match_context_t mctx;
+  memset (&mctx, '\0', sizeof (re_match_context_t));
+  mctx.dfa = dfa;
+#endif
+
+  /* Check for out-of-range.  */
+  if (BE (start < 0 || start > length, 0))
+    return -1;
+  if (BE (length < last_start || (0 <= range && last_start < start), 0))
+    last_start = length;
+  else if (BE (last_start < 0 || (range < 0 && start <= last_start), 0))
+    last_start = 0;
+
+  __libc_lock_lock (dfa->lock);
+
+  /* Compile fastmap if we haven't yet.  */
+  if (start < last_start && bufp->fastmap != NULL && !bufp->fastmap_accurate)
+    re_compile_fastmap (bufp);
+
+  fastmap = ((preg->fastmap != NULL && preg->fastmap_accurate
+             && start != last_start && !preg->can_be_null)
+            ? preg->fastmap : NULL);
+
+  /* Check if the DFA haven't been compiled.  */
+  if (BE (preg->used == 0 || dfa->init_state == NULL
+         || dfa->init_state_word == NULL || dfa->init_state_nl == NULL
+         || dfa->init_state_begbuf == NULL, 0))
+    return REG_NOMATCH;
+
+#ifdef DEBUG
+  assert (0 <= last_start && last_start <= length);
+#endif
+  rval = REG_NOMATCH;
+
+  /* If initial states with non-begbuf contexts have no elements,
+     the regex must be anchored.  If preg->newline_anchor is set,
+     we'll never use init_state_nl, so do not check it.  */
+  if (dfa->init_state->nodes.nelem == 0
+      && dfa->init_state_word->nodes.nelem == 0
+      && (dfa->init_state_nl->nodes.nelem == 0
+         || !preg->newline_anchor))
+    {
+      if (start != 0 && last_start != 0)
+        return REG_NOMATCH;
+      start = last_start = 0;
+    }
+
+  err = re_string_allocate (&mctx.input, string, length, dfa->nodes_len + 1,
+                           preg->translate, (preg->syntax & RE_ICASE) != 0,
+                           dfa);
+  if (BE (err != REG_NOERROR, 0))
+    goto free_return;
+  mctx.input.stop = stop;
+  mctx.input.raw_stop = stop;
+  mctx.input.newline_anchor = preg->newline_anchor;
+  mctx.eflags = eflags;
+
+  /* We will log all the DFA states through which the dfa pass,
+     if this dfa has a node which can accept multibyte character or
+     multi character collating element.  */
+#ifdef RE_ENABLE_I18N
+  if (dfa->has_mb_node)
+    {
+      /* Avoid overflow.  */
+      if (BE (SIZE_MAX / sizeof (re_dfastate_t *) <= mctx.input.bufs_len, 0))
+       {
+         err = REG_ESPACE;
+         goto free_return;
+       }
+
+      mctx.state_log = re_malloc (re_dfastate_t *, mctx.input.bufs_len + 1);
+      if (BE (mctx.state_log == NULL, 0))
+       {
+         err = REG_ESPACE;
+         goto free_return;
+       }
+    }
+  else
+#endif
+    mctx.state_log = NULL;
+
+  match_first = start;
+  mctx.input.tip_context = (eflags & REG_NOTBOL) ? CONTEXT_BEGBUF
+                          : CONTEXT_NEWLINE | CONTEXT_BEGBUF;
+
+  /* Check incrementally whether of not the input string match.  */
+  incr = (last_start < start) ? -1 : 1;
+  left_lim = (last_start < start) ? last_start : start;
+  right_lim = (last_start < start) ? start : last_start;
+  sb = dfa->mb_cur_max == 1;
+  match_kind =
+    (fastmap
+     ? ((sb || !(preg->syntax & RE_ICASE || t) ? 4 : 0)
+       | (start <= last_start ? 2 : 0)
+       | (t != NULL ? 1 : 0))
+     : 8);
+
+  for (;; match_first += incr)
+    {
+      err = REG_NOMATCH;
+      if (match_first < left_lim || right_lim < match_first)
+       goto free_return;
+
+      /* Advance as rapidly as possible through the string, until we
+        find a plausible place to start matching.  This may be done
+        with varying efficiency, so there are various possibilities:
+        only the most common of them are specialized, in order to
+        save on code size.  We use a switch statement for speed.  */
+      switch (match_kind)
+       {
+       case 8:
+         /* No fastmap.  */
+         break;
+
+       case 7:
+         /* Fastmap with single-byte translation, match forward.  */
+         while (BE (match_first < right_lim, 1)
+                && !fastmap[t[(unsigned char) string[match_first]]])
+           ++match_first;
+         goto forward_match_found_start_or_reached_end;
+
+       case 6:
+         /* Fastmap without translation, match forward.  */
+         while (BE (match_first < right_lim, 1)
+                && !fastmap[(unsigned char) string[match_first]])
+           ++match_first;
+
+       forward_match_found_start_or_reached_end:
+         if (BE (match_first == right_lim, 0))
+           {
+             ch = match_first >= length
+                      ? 0 : (unsigned char) string[match_first];
+             if (!fastmap[t ? t[ch] : ch])
+               goto free_return;
+           }
+         break;
+
+       case 4:
+       case 5:
+         /* Fastmap without multi-byte translation, match backwards.  */
+         while (match_first >= left_lim)
+           {
+             ch = match_first >= length
+                      ? 0 : (unsigned char) string[match_first];
+             if (fastmap[t ? t[ch] : ch])
+               break;
+             --match_first;
+           }
+         if (match_first < left_lim)
+           goto free_return;
+         break;
+
+       default:
+         /* In this case, we can't determine easily the current byte,
+            since it might be a component byte of a multibyte
+            character.  Then we use the constructed buffer instead.  */
+         for (;;)
+           {
+             /* If MATCH_FIRST is out of the valid range, reconstruct the
+                buffers.  */
+             __re_size_t offset = match_first - mctx.input.raw_mbs_idx;
+             if (BE (offset >= (__re_size_t) mctx.input.valid_raw_len, 0))
+               {
+                 err = re_string_reconstruct (&mctx.input, match_first,
+                                              eflags);
+                 if (BE (err != REG_NOERROR, 0))
+                   goto free_return;
+
+                 offset = match_first - mctx.input.raw_mbs_idx;
+               }
+             /* If MATCH_FIRST is out of the buffer, leave it as '\0'.
+                Note that MATCH_FIRST must not be smaller than 0.  */
+             ch = (match_first >= length
+                   ? 0 : re_string_byte_at (&mctx.input, offset));
+             if (fastmap[ch])
+               break;
+             match_first += incr;
+             if (match_first < left_lim || match_first > right_lim)
+               {
+                 err = REG_NOMATCH;
+                 goto free_return;
+               }
+           }
+         break;
+       }
+
+      /* Reconstruct the buffers so that the matcher can assume that
+        the matching starts from the beginning of the buffer.  */
+      err = re_string_reconstruct (&mctx.input, match_first, eflags);
+      if (BE (err != REG_NOERROR, 0))
+       goto free_return;
+
+#ifdef RE_ENABLE_I18N
+     /* Don't consider this char as a possible match start if it part,
+       yet isn't the head, of a multibyte character.  */
+      if (!sb && !re_string_first_byte (&mctx.input, 0))
+       continue;
+#endif
+
+      /* It seems to be appropriate one, then use the matcher.  */
+      err = check_matching (&mctx);
+
+      /* We found a partial match.  Record it and look for a full match.  */
+      if (err == REG_ECOLLATE || err == REG_ESUBREG)
+        {
+         if (rval == REG_NOMATCH)
+           rval = err;
+       }
+
+      /* We found a match, or we had an error.  Exit.  */
+      else if (err != REG_NOMATCH)
+        {
+         rval = err;
+         break;
+       }
+    }
+
+ free_return:
+#ifdef RE_ENABLE_I18N
+  re_free (mctx.state_log);
+#endif
+  re_string_destruct (&mctx.input);
+  __libc_lock_unlock (dfa->lock);
+  return rval;
+}
+
+/* Acquire an initial state and return it.
+   We must select appropriate initial state depending on the context,
+   since initial states may have constraints like "\<", "^", etc..  */
+
+static inline re_dfastate_t *
+__attribute ((always_inline)) internal_function
+acquire_init_state_context (reg_errcode_t *err, const re_match_context_t *mctx,
+                           Idx idx)
+{
+  const re_dfa_t *const dfa = mctx->dfa;
+  if (dfa->init_state->has_constraint)
+    {
+      unsigned int context;
+      context = re_string_context_at (&mctx->input, idx - 1, mctx->eflags);
+      if (IS_WORD_CONTEXT (context))
+       return dfa->init_state_word;
+      else if (IS_ORDINARY_CONTEXT (context))
+       return dfa->init_state;
+      else if (IS_BEGBUF_CONTEXT (context) && IS_NEWLINE_CONTEXT (context))
+       return dfa->init_state_begbuf;
+      else if (IS_NEWLINE_CONTEXT (context))
+       return dfa->init_state_nl;
+      else if (IS_BEGBUF_CONTEXT (context))
+       {
+         /* It is relatively rare case, then calculate on demand.  */
+         return re_acquire_state_context (err, dfa,
+                                          dfa->init_state->entrance_nodes,
+                                          context);
+       }
+      else
+       /* Must not happen?  */
+       return dfa->init_state;
+    }
+  else
+    return dfa->init_state;
+}
+
+/* Check whether the regular expression match input string INPUT or not,
+   and return the index where the matching end.  Return an error code.
+   Note that the matcher assume that the maching starts from the current
+   index of the buffer.  */
+
+static reg_errcode_t
+internal_function
+check_matching (re_match_context_t *mctx)
+{
+  const re_dfa_t *const dfa = mctx->dfa;
+  reg_errcode_t err;
+  Idx cur_str_idx = re_string_cur_idx (&mctx->input);
+  re_dfastate_t *cur_state;
+
+  err = REG_NOERROR;
+  cur_state = acquire_init_state_context (&err, mctx, cur_str_idx);
+  /* An initial state must not be NULL (invalid).  */
+  if (BE (cur_state == NULL, 0))
+    {
+      assert (err == REG_ESPACE);
+      return err;
+    }
+
+#ifdef RE_ENABLE_I18N
+  if (mctx->state_log != NULL)
+    mctx->state_log[cur_str_idx] = cur_state;
+#endif
+
+  /* If the RE accepts NULL string.  */
+  if (BE (cur_state->halt, 0))
+    {
+      if (!cur_state->has_constraint
+         || check_halt_state_context (mctx, cur_state, cur_str_idx))
+       return REG_NOERROR;
+    }
+
+  mctx->state_log_top = 0;
+  while (!re_string_eoi (&mctx->input))
+    {
+      re_dfastate_t *old_state = cur_state;
+      Idx next_char_idx = re_string_cur_idx (&mctx->input) + 1;
+
+      if (BE (cur_state->has_backref, 0))
+       return REG_ESUBREG;
+
+      if (BE (next_char_idx >= mctx->input.bufs_len, 0)
+          || (BE (next_char_idx >= mctx->input.valid_len, 0)
+              && mctx->input.valid_len < mctx->input.len))
+        {
+          err = extend_buffers (mctx);
+          if (BE (err != REG_NOERROR, 0))
+           {
+             assert (err == REG_ESPACE);
+             return err;
+           }
+        }
+
+      cur_state = transit_state (&err, mctx, cur_state);
+#ifdef RE_ENABLE_I18N
+      if (mctx->state_log != NULL)
+       cur_state = merge_state_with_log (&err, mctx, cur_state);
+#endif
+
+      if (cur_state == NULL)
+       {
+         /* Reached the invalid state or an error.  Try to recover a valid
+            state using the state log, if available and if we have not
+            already found a valid (even if not the longest) match.  */
+         if (BE (err != REG_NOERROR, 0))
+           return err;
+
+#ifdef RE_ENABLE_I18N
+         if (mctx->state_log == NULL
+             || (cur_state = find_recover_state (&err, mctx)) == NULL)
+#endif
+           return REG_NOMATCH;
+       }
+
+      if (cur_state->halt)
+       {
+         /* Reached a halt state.
+            Check the halt state can satisfy the current context.  */
+         if (!cur_state->has_constraint
+             || check_halt_state_context (mctx, cur_state,
+                                          re_string_cur_idx (&mctx->input)))
+           {
+             /* We found an appropriate halt state.  */
+             return REG_NOERROR;
+           }
+       }
+    }
+
+  return REG_NOMATCH;
+}
+
+/* Check NODE match the current context.  */
+
+static bool
+internal_function
+check_halt_node_context (const re_dfa_t *dfa, Idx node, unsigned int context)
+{
+  re_token_type_t type = dfa->nodes[node].type;
+  unsigned int constraint = dfa->nodes[node].constraint;
+  if (type != END_OF_RE)
+    return false;
+  if (!constraint)
+    return true;
+  if (NOT_SATISFY_NEXT_CONSTRAINT (constraint, context))
+    return false;
+  return true;
+}
+
+/* Check the halt state STATE match the current context.
+   Return 0 if not match, if the node, STATE has, is a halt node and
+   match the context, return the node.  */
+
+static Idx
+internal_function
+check_halt_state_context (const re_match_context_t *mctx,
+                         const re_dfastate_t *state, Idx idx)
+{
+  Idx i;
+  unsigned int context;
+#ifdef DEBUG
+  assert (state->halt);
+#endif
+  context = re_string_context_at (&mctx->input, idx, mctx->eflags);
+  for (i = 0; i < state->nodes.nelem; ++i)
+    if (check_halt_node_context (mctx->dfa, state->nodes.elems[i], context))
+      return state->nodes.elems[i];
+  return 0;
+}
+
+
+#ifdef RE_ENABLE_I18N
+/* Helper functions.  */
+
+static reg_errcode_t
+internal_function
+clean_state_log_if_needed (re_match_context_t *mctx, Idx next_state_log_idx)
+{
+  Idx top = mctx->state_log_top;
+
+  if (next_state_log_idx >= mctx->input.bufs_len
+      || (next_state_log_idx >= mctx->input.valid_len
+         && mctx->input.valid_len < mctx->input.len))
+    {
+      reg_errcode_t err;
+      err = extend_buffers (mctx);
+      if (BE (err != REG_NOERROR, 0))
+       return err;
+    }
+
+  if (top < next_state_log_idx)
+    {
+      memset (mctx->state_log + top + 1, '\0',
+             sizeof (re_dfastate_t *) * (next_state_log_idx - top));
+      mctx->state_log_top = next_state_log_idx;
+    }
+  return REG_NOERROR;
+}
+#endif
+
+
+/* Functions for state transition.  */
+
+/* Return the next state to which the current state STATE will transit by
+   accepting the current input byte, and update STATE_LOG if necessary.
+   If STATE can accept a multibyte char/collating element/back reference
+   update the destination of STATE_LOG.  */
+
+static re_dfastate_t *
+internal_function
+transit_state (reg_errcode_t *err, re_match_context_t *mctx,
+              re_dfastate_t *state)
+{
+  re_dfastate_t **trtable;
+  unsigned char ch;
+
+#ifdef RE_ENABLE_I18N
+  /* If the current state can accept multibyte.  */
+  if (BE (state->accept_mb, 0))
+    {
+      *err = transit_state_mb (mctx, state);
+      if (BE (*err != REG_NOERROR, 0))
+       return NULL;
+    }
+#endif /* RE_ENABLE_I18N */
+
+  /* Then decide the next state with the transition table  */
+  ch = re_string_fetch_byte (&mctx->input);
+  for (;;)
+    {
+      trtable = state->trtable;
+      if (BE (trtable != NULL, 1))
+       return trtable[ch];
+
+      trtable = state->word_trtable;
+      if (BE (trtable != NULL, 1))
+        {
+         unsigned int context;
+         context
+           = re_string_context_at (&mctx->input,
+                                   re_string_cur_idx (&mctx->input) - 1,
+                                   mctx->eflags);
+         if (IS_WORD_CONTEXT (context))
+           return trtable[ch + SBC_MAX];
+         else
+           return trtable[ch];
+       }
+
+      if (!build_trtable (mctx->dfa, state))
+       {
+         *err = REG_ESPACE;
+         return NULL;
+       }
+
+      /* Retry, we now have a transition table.  */
+    }
+}
+
+#ifdef RE_ENABLE_I18N
+/* Update the state_log if we need */
+
+static re_dfastate_t *
+internal_function
+merge_state_with_log (reg_errcode_t *err, re_match_context_t *mctx,
+                     re_dfastate_t *next_state)
+{
+  const re_dfa_t *const dfa = mctx->dfa;
+  Idx cur_idx = re_string_cur_idx (&mctx->input);
+
+  if (cur_idx > mctx->state_log_top)
+    {
+      mctx->state_log[cur_idx] = next_state;
+      mctx->state_log_top = cur_idx;
+    }
+  else if (mctx->state_log[cur_idx] == 0)
+    {
+      mctx->state_log[cur_idx] = next_state;
+    }
+  else
+    {
+      re_dfastate_t *pstate;
+      unsigned int context;
+      re_node_set next_nodes, *log_nodes, *table_nodes = NULL;
+      /* If (state_log[cur_idx] != 0), it implies that cur_idx is
+         the destination of a multibyte char/collating element/
+         back reference.  Then the next state is the union set of
+         these destinations and the results of the transition table.  */
+      pstate = mctx->state_log[cur_idx];
+      log_nodes = pstate->entrance_nodes;
+      if (next_state != NULL)
+        {
+          table_nodes = next_state->entrance_nodes;
+          *err = re_node_set_init_union (&next_nodes, table_nodes,
+                                            log_nodes);
+          if (BE (*err != REG_NOERROR, 0))
+           return NULL;
+        }
+      else
+        next_nodes = *log_nodes;
+      /* Note: We already add the nodes of the initial state,
+        then we don't need to add them here.  */
+
+      context = re_string_context_at (&mctx->input,
+                                     re_string_cur_idx (&mctx->input) - 1,
+                                     mctx->eflags);
+      next_state = mctx->state_log[cur_idx]
+        = re_acquire_state_context (err, dfa, &next_nodes, context);
+      /* We don't need to check errors here, since the return value of
+         this function is next_state and ERR is already set.  */
+
+      if (table_nodes != NULL)
+        re_node_set_free (&next_nodes);
+    }
+
+  return next_state;
+}
+
+/* Skip bytes in the input that correspond to part of a
+   multi-byte match, then look in the log for a state
+   from which to restart matching.  */
+
+static re_dfastate_t *
+internal_function
+find_recover_state (reg_errcode_t *err, re_match_context_t *mctx)
+{
+  re_dfastate_t *cur_state;
+  do
+    {
+      Idx max = mctx->state_log_top;
+      Idx cur_str_idx = re_string_cur_idx (&mctx->input);
+
+      do
+       {
+          if (++cur_str_idx > max)
+            return NULL;
+          re_string_skip_bytes (&mctx->input, 1);
+       }
+      while (mctx->state_log[cur_str_idx] == NULL);
+
+      cur_state = merge_state_with_log (err, mctx, NULL);
+    }
+  while (*err == REG_NOERROR && cur_state == NULL);
+  return cur_state;
+}
+
+static reg_errcode_t
+internal_function
+transit_state_mb (re_match_context_t *mctx, re_dfastate_t *pstate)
+{
+  const re_dfa_t *const dfa = mctx->dfa;
+  reg_errcode_t err;
+  Idx i;
+
+  for (i = 0; i < pstate->nodes.nelem; ++i)
+    {
+      re_node_set dest_nodes, *new_nodes;
+      Idx cur_node_idx = pstate->nodes.elems[i];
+      int naccepted;
+      Idx dest_idx;
+      unsigned int context;
+      re_dfastate_t *dest_state;
+
+      if (!dfa->nodes[cur_node_idx].accept_mb)
+        continue;
+
+      if (dfa->nodes[cur_node_idx].constraint)
+       {
+         context = re_string_context_at (&mctx->input,
+                                         re_string_cur_idx (&mctx->input),
+                                         mctx->eflags);
+         if (NOT_SATISFY_NEXT_CONSTRAINT (dfa->nodes[cur_node_idx].constraint,
+                                          context))
+           continue;
+       }
+
+      /* How many bytes the node can accept?  */
+      naccepted = check_node_accept_bytes (dfa, cur_node_idx, &mctx->input,
+                                          re_string_cur_idx (&mctx->input));
+      if (naccepted == 0)
+       continue;
+
+      /* The node can accepts `naccepted' bytes.  */
+      dest_idx = re_string_cur_idx (&mctx->input) + naccepted;
+      err = clean_state_log_if_needed (mctx, dest_idx);
+      if (BE (err != REG_NOERROR, 0))
+       return err;
+#ifdef DEBUG
+      assert (dfa->nexts[cur_node_idx] != REG_MISSING);
+#endif
+      new_nodes = dfa->eclosures + dfa->nexts[cur_node_idx];
+
+      dest_state = mctx->state_log[dest_idx];
+      if (dest_state == NULL)
+       dest_nodes = *new_nodes;
+      else
+       {
+         err = re_node_set_init_union (&dest_nodes,
+                                       dest_state->entrance_nodes, new_nodes);
+         if (BE (err != REG_NOERROR, 0))
+           return err;
+       }
+      context = re_string_context_at (&mctx->input, dest_idx - 1,
+                                     mctx->eflags);
+      mctx->state_log[dest_idx]
+       = re_acquire_state_context (&err, dfa, &dest_nodes, context);
+      if (dest_state != NULL)
+       re_node_set_free (&dest_nodes);
+      if (BE (mctx->state_log[dest_idx] == NULL && err != REG_NOERROR, 0))
+       return err;
+    }
+  return REG_NOERROR;
+}
+#endif /* RE_ENABLE_I18N */
+
+/* Build transition table for the state.
+   Return true if successful.  */
+
+static bool
+internal_function
+build_trtable (const re_dfa_t *dfa, re_dfastate_t *state)
+{
+  reg_errcode_t err;
+  Idx i, j;
+  int ch;
+  bool need_word_trtable = false;
+  bitset_word_t elem, mask;
+  bool dests_node_malloced = false;
+  bool dest_states_malloced = false;
+  Idx ndests; /* Number of the destination states from `state'.  */
+  re_dfastate_t **trtable;
+  re_dfastate_t **dest_states = NULL, **dest_states_word, **dest_states_nl;
+  re_node_set follows, *dests_node;
+  bitset_t *dests_ch;
+  bitset_t acceptable;
+
+  struct dests_alloc
+  {
+    re_node_set dests_node[SBC_MAX];
+    bitset_t dests_ch[SBC_MAX];
+  } *dests_alloc;
+
+  /* We build DFA states which corresponds to the destination nodes
+     from `state'.  `dests_node[i]' represents the nodes which i-th
+     destination state contains, and `dests_ch[i]' represents the
+     characters which i-th destination state accepts.  */
+  if (__libc_use_alloca (sizeof (struct dests_alloc)))
+    dests_alloc = (struct dests_alloc *) alloca (sizeof (struct dests_alloc));
+  else
+    {
+      dests_alloc = re_malloc (struct dests_alloc, 1);
+      if (BE (dests_alloc == NULL, 0))
+       return false;
+      dests_node_malloced = true;
+    }
+  dests_node = dests_alloc->dests_node;
+  dests_ch = dests_alloc->dests_ch;
+
+  /* Initialize transiton table.  */
+  state->word_trtable = state->trtable = NULL;
+
+  /* At first, group all nodes belonging to `state' into several
+     destinations.  */
+  ndests = group_nodes_into_DFAstates (dfa, state, dests_node, dests_ch);
+  if (BE (! REG_VALID_NONZERO_INDEX (ndests), 0))
+    {
+      if (dests_node_malloced)
+       free (dests_alloc);
+      if (ndests == 0)
+       {
+         state->trtable = (re_dfastate_t **)
+           calloc (sizeof (re_dfastate_t *), SBC_MAX);
+         return true;
+       }
+      return false;
+    }
+
+  err = re_node_set_alloc (&follows, ndests + 1);
+  if (BE (err != REG_NOERROR, 0))
+    goto out_free;
+
+  /* Avoid arithmetic overflow in size calculation.  */
+  if (BE ((((SIZE_MAX - (sizeof (re_node_set) + sizeof (bitset_t)) * SBC_MAX)
+           / (3 * sizeof (re_dfastate_t *)))
+          < ndests),
+         0))
+    goto out_free;
+
+  if (__libc_use_alloca ((sizeof (re_node_set) + sizeof (bitset_t)) * SBC_MAX
+                        + ndests * 3 * sizeof (re_dfastate_t *)))
+    dest_states = (re_dfastate_t **)
+      alloca (ndests * 3 * sizeof (re_dfastate_t *));
+  else
+    {
+      dest_states = (re_dfastate_t **)
+       malloc (ndests * 3 * sizeof (re_dfastate_t *));
+      if (BE (dest_states == NULL, 0))
+       {
+out_free:
+         if (dest_states_malloced)
+           free (dest_states);
+         re_node_set_free (&follows);
+         for (i = 0; i < ndests; ++i)
+           re_node_set_free (dests_node + i);
+         if (dests_node_malloced)
+           free (dests_alloc);
+         return false;
+       }
+      dest_states_malloced = true;
+    }
+  dest_states_word = dest_states + ndests;
+  dest_states_nl = dest_states_word + ndests;
+  bitset_empty (acceptable);
+
+  /* Then build the states for all destinations.  */
+  for (i = 0; i < ndests; ++i)
+    {
+      Idx next_node;
+      re_node_set_empty (&follows);
+      /* Merge the follows of this destination states.  */
+      for (j = 0; j < dests_node[i].nelem; ++j)
+       {
+         next_node = dfa->nexts[dests_node[i].elems[j]];
+         if (next_node != REG_MISSING)
+           {
+             err = re_node_set_merge (&follows, dfa->eclosures + next_node);
+             if (BE (err != REG_NOERROR, 0))
+               goto out_free;
+           }
+       }
+      dest_states[i] = re_acquire_state_context (&err, dfa, &follows, 0);
+      if (BE (dest_states[i] == NULL && err != REG_NOERROR, 0))
+       goto out_free;
+      /* If the new state has context constraint,
+        build appropriate states for these contexts.  */
+      if (dest_states[i]->has_constraint)
+       {
+         dest_states_word[i] = re_acquire_state_context (&err, dfa, &follows,
+                                                         CONTEXT_WORD);
+         if (BE (dest_states_word[i] == NULL && err != REG_NOERROR, 0))
+           goto out_free;
+
+         if (dest_states[i] != dest_states_word[i] && dfa->mb_cur_max > 1)
+           need_word_trtable = true;
+
+         dest_states_nl[i] = re_acquire_state_context (&err, dfa, &follows,
+                                                       CONTEXT_NEWLINE);
+         if (BE (dest_states_nl[i] == NULL && err != REG_NOERROR, 0))
+           goto out_free;
+       }
+      else
+       {
+         dest_states_word[i] = dest_states[i];
+         dest_states_nl[i] = dest_states[i];
+       }
+      bitset_merge (acceptable, dests_ch[i]);
+    }
+
+  if (!BE (need_word_trtable, 0))
+    {
+      /* We don't care about whether the following character is a word
+        character, or we are in a single-byte character set so we can
+        discern by looking at the character code: allocate a
+        256-entry transition table.  */
+      trtable = state->trtable =
+       (re_dfastate_t **) calloc (sizeof (re_dfastate_t *), SBC_MAX);
+      if (BE (trtable == NULL, 0))
+       goto out_free;
+
+      /* For all characters ch...:  */
+      for (i = 0; i < BITSET_WORDS; ++i)
+       for (ch = i * BITSET_WORD_BITS, elem = acceptable[i], mask = 1;
+            elem;
+            mask <<= 1, elem >>= 1, ++ch)
+         if (BE (elem & 1, 0))
+           {
+             /* There must be exactly one destination which accepts
+                character ch.  See group_nodes_into_DFAstates.  */
+             for (j = 0; (dests_ch[j][i] & mask) == 0; ++j)
+               ;
+
+             /* j-th destination accepts the word character ch.  */
+             if (dfa->word_char[i] & mask)
+               trtable[ch] = dest_states_word[j];
+             else
+               trtable[ch] = dest_states[j];
+           }
+    }
+  else
+    {
+      /* We care about whether the following character is a word
+        character, and we are in a multi-byte character set: discern
+        by looking at the character code: build two 256-entry
+        transition tables, one starting at trtable[0] and one
+        starting at trtable[SBC_MAX].  */
+      trtable = state->word_trtable =
+       (re_dfastate_t **) calloc (sizeof (re_dfastate_t *), 2 * SBC_MAX);
+      if (BE (trtable == NULL, 0))
+       goto out_free;
+
+      /* For all characters ch...:  */
+      for (i = 0; i < BITSET_WORDS; ++i)
+       for (ch = i * BITSET_WORD_BITS, elem = acceptable[i], mask = 1;
+            elem;
+            mask <<= 1, elem >>= 1, ++ch)
+         if (BE (elem & 1, 0))
+           {
+             /* There must be exactly one destination which accepts
+                character ch.  See group_nodes_into_DFAstates.  */
+             for (j = 0; (dests_ch[j][i] & mask) == 0; ++j)
+               ;
+
+             /* j-th destination accepts the word character ch.  */
+             trtable[ch] = dest_states[j];
+             trtable[ch + SBC_MAX] = dest_states_word[j];
+           }
+    }
+
+  /* new line */
+  if (bitset_contain (acceptable, NEWLINE_CHAR))
+    {
+      /* The current state accepts newline character.  */
+      for (j = 0; j < ndests; ++j)
+       if (bitset_contain (dests_ch[j], NEWLINE_CHAR))
+         {
+           /* k-th destination accepts newline character.  */
+           trtable[NEWLINE_CHAR] = dest_states_nl[j];
+           if (need_word_trtable)
+             trtable[NEWLINE_CHAR + SBC_MAX] = dest_states_nl[j];
+           /* There must be only one destination which accepts
+              newline.  See group_nodes_into_DFAstates.  */
+           break;
+         }
+    }
+
+  if (dest_states_malloced)
+    free (dest_states);
+
+  re_node_set_free (&follows);
+  for (i = 0; i < ndests; ++i)
+    re_node_set_free (dests_node + i);
+
+  if (dests_node_malloced)
+    free (dests_alloc);
+
+  return true;
+}
+
+/* Group all nodes belonging to STATE into several destinations.
+   Then for all destinations, set the nodes belonging to the destination
+   to DESTS_NODE[i] and set the characters accepted by the destination
+   to DEST_CH[i].  This function return the number of destinations.  */
+
+static Idx
+internal_function
+group_nodes_into_DFAstates (const re_dfa_t *dfa, const re_dfastate_t *state,
+                           re_node_set *dests_node, bitset_t *dests_ch)
+{
+  reg_errcode_t err;
+  bool ok;
+  Idx i, j, k;
+  Idx ndests; /* Number of the destinations from `state'.  */
+  bitset_t accepts; /* Characters a node can accept.  */
+  const re_node_set *cur_nodes = &state->nodes;
+  bitset_empty (accepts);
+  ndests = 0;
+
+  /* For all the nodes belonging to `state',  */
+  for (i = 0; i < cur_nodes->nelem; ++i)
+    {
+      re_token_t *node = &dfa->nodes[cur_nodes->elems[i]];
+      re_token_type_t type = node->type;
+      unsigned int constraint = node->constraint;
+
+      /* Enumerate all single byte character this node can accept.  */
+      if (type == CHARACTER)
+       bitset_set (accepts, node->opr.c);
+      else if (type == SIMPLE_BRACKET)
+       {
+         bitset_merge (accepts, node->opr.sbcset);
+       }
+      else if (type == OP_PERIOD)
+       {
+#ifdef RE_ENABLE_I18N
+         if (dfa->mb_cur_max > 1)
+           bitset_merge (accepts, dfa->sb_char);
+         else
+#endif
+           bitset_set_all (accepts);
+         if (!(dfa->syntax & RE_DOT_NEWLINE))
+           bitset_clear (accepts, '\n');
+         if (dfa->syntax & RE_DOT_NOT_NULL)
+           bitset_clear (accepts, '\0');
+       }
+#ifdef RE_ENABLE_I18N
+      else if (type == OP_UTF8_PERIOD)
+        {
+         if (ASCII_CHARS % BITSET_WORD_BITS == 0)
+           memset (accepts, -1, ASCII_CHARS / CHAR_BIT);
+         else
+           bitset_merge (accepts, utf8_sb_map);
+         if (!(dfa->syntax & RE_DOT_NEWLINE))
+           bitset_clear (accepts, '\n');
+         if (dfa->syntax & RE_DOT_NOT_NULL)
+           bitset_clear (accepts, '\0');
+        }
+#endif
+      else
+       continue;
+
+      /* Check the `accepts' and sift the characters which are not
+        match it the context.  */
+      if (constraint)
+       {
+         if (constraint & NEXT_NEWLINE_CONSTRAINT)
+           {
+             bool accepts_newline = bitset_contain (accepts, NEWLINE_CHAR);
+             bitset_empty (accepts);
+             if (accepts_newline)
+               bitset_set (accepts, NEWLINE_CHAR);
+             else
+               continue;
+           }
+         if (constraint & NEXT_ENDBUF_CONSTRAINT)
+           {
+             bitset_empty (accepts);
+             continue;
+           }
+
+         if (constraint & NEXT_WORD_CONSTRAINT)
+           {
+             bitset_word_t any_set = 0;
+             if (type == CHARACTER && !node->word_char)
+               {
+                 bitset_empty (accepts);
+                 continue;
+               }
+#ifdef RE_ENABLE_I18N
+             if (dfa->mb_cur_max > 1)
+               for (j = 0; j < BITSET_WORDS; ++j)
+                 any_set |= (accepts[j] &= (dfa->word_char[j] | 
~dfa->sb_char[j]));
+             else
+#endif
+               for (j = 0; j < BITSET_WORDS; ++j)
+                 any_set |= (accepts[j] &= dfa->word_char[j]);
+             if (!any_set)
+               continue;
+           }
+         if (constraint & NEXT_NOTWORD_CONSTRAINT)
+           {
+             bitset_word_t any_set = 0;
+             if (type == CHARACTER && node->word_char)
+               {
+                 bitset_empty (accepts);
+                 continue;
+               }
+#ifdef RE_ENABLE_I18N
+             if (dfa->mb_cur_max > 1)
+               for (j = 0; j < BITSET_WORDS; ++j)
+                 any_set |= (accepts[j] &= ~(dfa->word_char[j] & 
dfa->sb_char[j]));
+             else
+#endif
+               for (j = 0; j < BITSET_WORDS; ++j)
+                 any_set |= (accepts[j] &= ~dfa->word_char[j]);
+             if (!any_set)
+               continue;
+           }
+       }
+
+      /* Then divide `accepts' into DFA states, or create a new
+        state.  Above, we make sure that accepts is not empty.  */
+      for (j = 0; j < ndests; ++j)
+       {
+         bitset_t intersec; /* Intersection sets, see below.  */
+         bitset_t remains;
+         /* Flags, see below.  */
+         bitset_word_t has_intersec, not_subset, not_consumed;
+
+         /* Optimization, skip if this state doesn't accept the character.  */
+         if (type == CHARACTER && !bitset_contain (dests_ch[j], node->opr.c))
+           continue;
+
+         /* Enumerate the intersection set of this state and `accepts'.  */
+         has_intersec = 0;
+         for (k = 0; k < BITSET_WORDS; ++k)
+           has_intersec |= intersec[k] = accepts[k] & dests_ch[j][k];
+         /* And skip if the intersection set is empty.  */
+         if (!has_intersec)
+           continue;
+
+         /* Then check if this state is a subset of `accepts'.  */
+         not_subset = not_consumed = 0;
+         for (k = 0; k < BITSET_WORDS; ++k)
+           {
+             not_subset |= remains[k] = ~accepts[k] & dests_ch[j][k];
+             not_consumed |= accepts[k] = accepts[k] & ~dests_ch[j][k];
+           }
+
+         /* If this state isn't a subset of `accepts', create a
+            new group state, which has the `remains'. */
+         if (not_subset)
+           {
+             bitset_copy (dests_ch[ndests], remains);
+             bitset_copy (dests_ch[j], intersec);
+             err = re_node_set_init_copy (dests_node + ndests, &dests_node[j]);
+             if (BE (err != REG_NOERROR, 0))
+               goto error_return;
+             ++ndests;
+           }
+
+         /* Put the position in the current group. */
+         ok = re_node_set_insert (&dests_node[j], cur_nodes->elems[i]);
+         if (BE (! ok, 0))
+           goto error_return;
+
+         /* If all characters are consumed, go to next node. */
+         if (!not_consumed)
+           break;
+       }
+      /* Some characters remain, create a new group. */
+      if (j == ndests)
+       {
+         bitset_copy (dests_ch[ndests], accepts);
+         err = re_node_set_init_1 (dests_node + ndests, cur_nodes->elems[i]);
+         if (BE (err != REG_NOERROR, 0))
+           goto error_return;
+         ++ndests;
+         bitset_empty (accepts);
+       }
+    }
+  return ndests;
+ error_return:
+  for (j = 0; j < ndests; ++j)
+    re_node_set_free (dests_node + j);
+  return REG_MISSING;
+}
+
+#ifdef RE_ENABLE_I18N
+/* Check how many bytes the node `dfa->nodes[node_idx]' accepts.
+   Return the number of the bytes the node accepts.
+   STR_IDX is the current index of the input string.
+
+   This function handles the nodes which can accept one character, or
+   one collating element like '.', '[a-z]', opposite to the other nodes
+   can only accept one byte.  */
+
+static int
+internal_function
+check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx,
+                        const re_string_t *input, Idx str_idx)
+{
+  const re_token_t *node = dfa->nodes + node_idx;
+  int char_len, elem_len;
+  Idx i;
+
+  if (BE (node->type == OP_UTF8_PERIOD, 0))
+    {
+      unsigned char c = re_string_byte_at (input, str_idx), d;
+      if (BE (c < 0xc2, 1))
+       return 0;
+
+      if (str_idx + 2 > input->len)
+       return 0;
+
+      d = re_string_byte_at (input, str_idx + 1);
+      if (c < 0xe0)
+       return (d < 0x80 || d > 0xbf) ? 0 : 2;
+      else if (c < 0xf0)
+       {
+         char_len = 3;
+         if (c == 0xe0 && d < 0xa0)
+           return 0;
+       }
+      else if (c < 0xf8)
+       {
+         char_len = 4;
+         if (c == 0xf0 && d < 0x90)
+           return 0;
+       }
+      else if (c < 0xfc)
+       {
+         char_len = 5;
+         if (c == 0xf8 && d < 0x88)
+           return 0;
+       }
+      else if (c < 0xfe)
+       {
+         char_len = 6;
+         if (c == 0xfc && d < 0x84)
+           return 0;
+       }
+      else
+       return 0;
+
+      if (str_idx + char_len > input->len)
+       return 0;
+
+      for (i = 1; i < char_len; ++i)
+       {
+         d = re_string_byte_at (input, str_idx + i);
+         if (d < 0x80 || d > 0xbf)
+           return 0;
+       }
+      return char_len;
+    }
+
+  char_len = re_string_char_size_at (input, str_idx);
+  if (node->type == OP_PERIOD)
+    {
+      if (char_len <= 1)
+        return 0;
+      /* FIXME: I don't think this if is needed, as both '\n'
+        and '\0' are char_len == 1.  */
+      /* '.' accepts any one character except the following two cases.  */
+      if ((!(dfa->syntax & RE_DOT_NEWLINE) &&
+          re_string_byte_at (input, str_idx) == '\n') ||
+         ((dfa->syntax & RE_DOT_NOT_NULL) &&
+          re_string_byte_at (input, str_idx) == '\0'))
+       return 0;
+      return char_len;
+    }
+
+  elem_len = re_string_elem_size_at (input, str_idx);
+  if ((elem_len <= 1 && char_len <= 1) || char_len == 0)
+    return 0;
+
+  if (node->type == COMPLEX_BRACKET)
+    {
+      const re_charset_t *cset = node->opr.mbcset;
+      int match_len = 0;
+      wchar_t wc = ((cset->nranges || cset->nchar_classes || cset->nmbchars)
+                   ? re_string_wchar_at (input, str_idx) : 0);
+
+      /* match with multibyte character?  */
+      for (i = 0; i < cset->nmbchars; ++i)
+       if (wc == cset->mbchars[i])
+         {
+           match_len = char_len;
+           goto check_node_accept_bytes_match;
+         }
+      /* match with character_class?  */
+      for (i = 0; i < cset->nchar_classes; ++i)
+       {
+         wctype_t wt = cset->char_classes[i];
+         if (__iswctype (wc, wt))
+           {
+             match_len = char_len;
+             goto check_node_accept_bytes_match;
+           }
+       }
+
+      /* match with collating_symbol?  I removed _LIBC only stuff here,
+         it must be merged back if this code ever gets into glibc.  */
+      if (cset->ncoll_syms)
+       return REG_ECOLLATE;
+      else
+       {
+         /* match with range expression?  */
+#if __GNUC__ >= 2 && ! (__STDC_VERSION__ < 199901L && __STRICT_ANSI__)
+         wchar_t cmp_buf[] = {L'\0', L'\0', wc, L'\0', L'\0', L'\0'};
+#else
+         wchar_t cmp_buf[] = {L'\0', L'\0', L'\0', L'\0', L'\0', L'\0'};
+         cmp_buf[2] = wc;
+#endif
+         for (i = 0; i < cset->nranges; ++i)
+           {
+             cmp_buf[0] = cset->range_starts[i];
+             cmp_buf[4] = cset->range_ends[i];
+             if (wcscoll (cmp_buf, cmp_buf + 2) <= 0
+                 && wcscoll (cmp_buf + 2, cmp_buf + 4) <= 0)
+               {
+                 match_len = char_len;
+                 goto check_node_accept_bytes_match;
+               }
+           }
+       }
+    check_node_accept_bytes_match:
+      if (!cset->non_match)
+       return match_len;
+      else
+       {
+         if (match_len > 0)
+           return 0;
+         else
+           return (elem_len > char_len) ? elem_len : char_len;
+       }
+    }
+  return 0;
+}
+#endif /* RE_ENABLE_I18N */
+
+/* Extend the buffers, if the buffers have run out.  */
+
+static reg_errcode_t
+internal_function
+extend_buffers (re_match_context_t *mctx)
+{
+  reg_errcode_t ret;
+  re_string_t *pstr = &mctx->input;
+
+  /* Avoid overflow.  */
+  if (BE (SIZE_MAX / 2 / sizeof (re_dfastate_t *) <= pstr->bufs_len, 0))
+    return REG_ESPACE;
+
+  /* Double the lengthes of the buffers.  */
+  ret = re_string_realloc_buffers (pstr, pstr->bufs_len * 2);
+  if (BE (ret != REG_NOERROR, 0))
+    return ret;
+
+#ifdef RE_ENABLE_I18N
+  if (mctx->state_log != NULL)
+    {
+      /* And double the length of state_log.  */
+      /* XXX We have no indication of the size of this buffer.  If this
+        allocation fail we have no indication that the state_log array
+        does not have the right size.  */
+      re_dfastate_t **new_array = re_realloc (mctx->state_log, re_dfastate_t *,
+                                             pstr->bufs_len + 1);
+      if (BE (new_array == NULL, 0))
+       return REG_ESPACE;
+      mctx->state_log = new_array;
+    }
+#endif
+
+  /* Then reconstruct the buffers.  */
+  if (pstr->icase)
+    {
+#ifdef RE_ENABLE_I18N
+      if (pstr->mb_cur_max > 1)
+       {
+         ret = build_wcs_upper_buffer (pstr);
+         if (BE (ret != REG_NOERROR, 0))
+           return ret;
+       }
+      else
+#endif /* RE_ENABLE_I18N  */
+       build_upper_buffer (pstr);
+    }
+  else
+    {
+#ifdef RE_ENABLE_I18N
+      if (pstr->mb_cur_max > 1)
+       build_wcs_buffer (pstr);
+      else
+#endif /* RE_ENABLE_I18N  */
+       {
+         if (pstr->trans != NULL)
+           re_string_translate_buffer (pstr);
+       }
+    }
+  return REG_NOERROR;
+}
+
+
+#undef check_matching
+#undef check_halt_state_context
+#undef find_recover_state
+#undef transit_state
+#undef merge_state_with_log
+#undef transit_state_mb
+#undef build_trtable
+#undef check_node_accept_bytes
+#undef group_nodes_into_DFAstates
+#undef extend_buffers
+#undef acquire_init_state_context
+#undef check_halt_node_context
+#undef clean_state_log_if_needed
diff --git a/lib/regex.c b/lib/regex.c
index d4eb726..368d057 100644
--- a/lib/regex.c
+++ b/lib/regex.c
@@ -61,6 +61,10 @@
 #include "regcomp.c"
 #include "regexec.c"
 
+#ifdef HAVE_REGEX_DFA
+#include "regdfa.c"
+#endif
+
 /* Binary backward compatibility.  */
 #if _LIBC
 # include <shlib-compat.h>
diff --git a/lib/regex.h b/lib/regex.h
index 7a79ca3..1ebe842 100644
--- a/lib/regex.h
+++ b/lib/regex.h
@@ -565,6 +565,18 @@ extern int re_compile_fastmap (struct re_pattern_buffer 
*__buffer);
 
 /* Search in the string STRING (with length LENGTH) for the pattern
    compiled into BUFFER.  Start searching at position START, for RANGE
+   characters.  Return REG_NOERROR for a match, REG_NOMATCH for no
+   match, REG_ECOLLATE or REG_ESUBEXP for a partial match, an error
+   code for an internal error.  */
+reg_errcode_t
+re_search_dfa (struct re_pattern_buffer *bufp,
+              const char *string, __re_idx_t length,
+              __re_idx_t start, regoff_t range, __re_idx_t stop,
+              int eflags);
+
+
+/* Search in the string STRING (with length LENGTH) for the pattern
+   compiled into BUFFER.  Start searching at position START, for RANGE
    characters.  Return the starting position of the match, -1 for no
    match, or -2 for an internal error.  Also return register
    information in REGS (if REGS and BUFFER->no_sub are nonzero).  */
diff --git a/lib/regexec.c b/lib/regexec.c
index 21a8166..97086c4 100644
--- a/lib/regexec.c
+++ b/lib/regexec.c
@@ -717,7 +717,7 @@ re_search_internal (const regex_t *preg,
      if nmatch > 1, or this dfa has "multibyte node", which is a
      back-reference or a node which can accept multibyte character or
      multi character collating element.  */
-  if (nmatch > 1 || dfa->has_mb_node)
+  if (nmatch > 1 || dfa->has_mb_node || dfa->nbackref)
     {
       /* Avoid overflow.  */
       if (BE (SIZE_MAX / sizeof (re_dfastate_t *) <= mctx.input.bufs_len, 0))
diff --git a/m4/regex.m4 b/m4/regex.m4
index aef53d2..7150117 100644
--- a/m4/regex.m4
+++ b/m4/regex.m4
@@ -15,6 +15,7 @@ AC_PREREQ([2.50])
 AC_DEFUN([gl_REGEX],
 [
   AC_CHECK_HEADERS_ONCE([locale.h])
+  m4_divert_text([DEFAULTS], [ac_use_included_regex=])
 
   AC_ARG_WITH([included-regex],
     [AS_HELP_STRING([--without-included-regex],
@@ -22,10 +23,12 @@ AC_DEFUN([gl_REGEX],
                     with recent-enough versions of the GNU C Library
                     (use with caution on other systems).])])
 
-  case $with_included_regex in #(
-  yes|no) ac_use_included_regex=$with_included_regex
+  case $with_included_regex:$ac_use_included_regex in #(
+  yes:yes|no:yes|:yes)
        ;;
-  '')
+  yes:|no:) ac_use_included_regex=$with_included_regex
+       ;;
+  :)
     # If the system regex support is good enough that it passes the
     # following run test, then default to *not* using the included regex.c.
     # If cross compiling, assume the test would fail and use the included
diff --git a/m4/regexdfa.m4 b/m4/regexdfa.m4
new file mode 100644
index 0000000..5dd43e6
--- /dev/null
+++ b/m4/regexdfa.m4
@@ -0,0 +1,17 @@
+# serial 54
+
+# Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001, 2003, 2004, 2005,
+# 2006, 2007, 2008, 2009 Free Software Foundation, Inc.
+#
+# This file is free software; the Free Software Foundation
+# gives unlimited permission to copy and/or distribute it,
+# with or without modifications, as long as this notice is preserved.
+
+AC_PREREQ([2.50])
+
+AC_DEFUN([gl_REGEXDFA],
+[
+  m4_divert_text([INIT_PREPARE], [ac_use_included_regex=yes])
+  AC_DEFINE([HAVE_REGEX_DFA], [1],
+    [Define if you want to include re_search_dfa.])
+])
diff --git a/modules/regexdfa b/modules/regexdfa
new file mode 100644
index 0000000..ba9bf86
--- /dev/null
+++ b/modules/regexdfa
@@ -0,0 +1,23 @@
+Description:
+Regular expression matching.
+
+Files:
+lib/regdfa.c
+m4/regexdfa.m4
+
+Depends-on:
+regex
+
+configure.ac:
+gl_REGEXDFA
+
+Makefile.am:
+
+Include:
+"regex.h"
+
+License:
+LGPLv2+
+
+Maintainer:
+Paolo Bonzini
-- 
1.6.5.2





reply via email to

[Prev in Thread] Current Thread [Next in Thread]