bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Bug in [...]* matching with acute-u


From: Jorge Stolfi
Subject: Bug in [...]* matching with acute-u
Date: Sat, 27 Jan 2001 06:46:11 -0200 (EDT)

Hi,

I think I have run into a bug in gawk's handling of REs of the
form [...]* when the bracketed list includes certain 8-bit characters,
specifically u-acute (octal \372).

The problem occurs in GNU Awk 3.0.4, both under 
Linux 2.2.14-5.0 (intel i686) and SunOS 5.5 (Sun sparc).

Here is a program that illustrates the bug, and its output.
The first two lines of the output should be equal, shouldn't they?

----------------------------------------------------------------------
#! /usr/bin/gawk -f

BEGIN {
  s = "bananas and ananases in canaan";
  t = s; gsub(/[an]*n/, "AN", t);   printf "%-8s  %s\n", "[an]*n", t;
  t = s; gsub(/[anú]*n/, "AN", t);  printf "%-8s  %s\n", "[anú]*n", t;
  print "";
  t = s; gsub(/[aú]*n/, "AN", t);   printf "%-8s  %s\n", "[aú]*n", t;
  print "";
  t = s; gsub(/[an]n/, "AN", t);    printf "%-8s  %s\n", "[an]n", t;
  t = s; gsub(/[aú]n/, "AN", t);    printf "%-8s  %s\n", "[aú]n", t;
  t = s; gsub(/[anú]n/, "AN", t);   printf "%-8s  %s\n", "[anú]n", t;
  print "";
  t = s; gsub(/[an]?n/, "AN", t);   printf "%-8s  %s\n", "[an]?n", t;
  t = s; gsub(/[aú]?n/, "AN", t);   printf "%-8s  %s\n", "[aú]?n", t;
  t = s; gsub(/[anú]?n/, "AN", t);  printf "%-8s  %s\n", "[anú]?n", t;
  print "";
  t = s; gsub(/[an]+n/, "AN", t);   printf "%-8s  %s\n", "[an]+n", t;
  t = s; gsub(/[aú]+n/,  "AN", t);  printf "%-8s  %s\n", "[aú]+n", t;
  t = s; gsub(/[anú]+n/, "AN", t);  printf "%-8s  %s\n", "[anú]+n", t;
}
----------------------------------------------------------------------
[an]*n    bANas ANd ANases iAN cAN
[anú]*n   bananas and ananases in canaan

[aú]*n    bANANas ANd ANANases iAN cANAN

[an]n     bANANas ANd ANANases in cANaAN
[aú]n     bANANas ANd ANANases in cANaAN
[anú]n    bANANas ANd ANANases in cANaAN

[an]?n    bANANas ANd ANANases iAN cANaAN
[aú]?n    bANANas ANd ANANases iAN cANaAN
[anú]?n   bANANas ANd ANANases iAN cANaAN

[an]+n    bANas ANd ANases in cAN
[aú]+n    bANANas ANd ANANases in cANAN
[anú]+n   bananas and ananases in canaan
----------------------------------------------------------------------

Apparently the problem is specific to u-acute; I've tried several
other 8-bit characters and they seem to behave as expected.

By comparing the second and third output lines, it would seem that the
problem involves backtracking out of a partial match of [...]* in
order to match the next sub-expression, when the latter begins with
one of the given characters.


All the best,

--stolfi

------------------------------------------------------------------------
Jorge Stolfi | http://www.dcc.unicamp.br/~stolfi | address@hidden 
Institute of Computing (formerly DCC-IMECC)      | Wrk +55 (19)3788-5858
Universidade Estadual de Campinas (UNICAMP)      |     +55 (19)3788-5840
Av. Albert Einstein 1251 - Caixa Postal 6176     | Fax +55 (19)3788-5847
13083-970 Campinas, SP -- Brazil                 | Hom +55 (19)3287-4069        
         
------------------------------------------------------------------------



reply via email to

[Prev in Thread] Current Thread [Next in Thread]