[bug #37600] grep -w cuts words on non-ascii

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #37600] grep -w cuts words on non-ascii

From:	Flammie Pirinen
Subject:	[bug #37600] grep -w cuts words on non-ascii
Date:	Fri, 19 Oct 2012 02:10:23 +0000
User-agent:	Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.57 Safari/537.1

URL:
  <http://savannah.gnu.org/bugs/?37600>

                 Summary: grep -w cuts words on non-ascii
                 Project: grep
            Submitted by: flammie
            Submitted on: Fri 19 Oct 2012 02:10:23 AM GMT
                Category: None
                Severity: 3 - Normal
              Item Group: None
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any

    _______________________________________________________

Details:

It seems that grep -w does not support non-ascii characters, at least for
locale fi-FI.utf8:

$ cat > test
xxx
xxxä
xxxx
$ grep -w xxx test 
xxx
xxxä

System is Gentoo Linux, stable, x86 with GNU glibc-2.14.1-r3 and following
setup:

$ grep -V
grep (GNU grep) 2.12
$ locale
LANG=fi_FI.UTF-8
LC_CTYPE="fi_FI.UTF-8"
LC_NUMERIC="fi_FI.UTF-8"
LC_TIME="fi_FI.UTF-8"
LC_COLLATE="fi_FI.UTF-8"
LC_MONETARY="fi_FI.UTF-8"
LC_MESSAGES="fi_FI.UTF-8"
LC_PAPER="fi_FI.UTF-8"
LC_NAME="fi_FI.UTF-8"
LC_ADDRESS="fi_FI.UTF-8"
LC_TELEPHONE="fi_FI.UTF-8"
LC_MEASUREMENT="fi_FI.UTF-8"
LC_IDENTIFICATION="fi_FI.UTF-8"
LC_ALL=fi_FI.UTF-8


If this behaviour is intentional, the description of -w switch in
documentation should be clarified. Since grep can well match ä to [:alpha:]
class on my locale I would expect from following that ä is a "word
constituent character":


       -w, --word-regexp
              Select  only  those  lines  containing  matches  that form
whole
              words.  The test is that the matching substring must  either 
be
              at  the  beginning  of  the  line,  or  preceded  by  a
non-word
              constituent character.  Similarly, it must be either at the 
end
              of  the  line  or  followed by a non-word constituent
character.
              Word-constituent  characters  are  letters,  digits,   and  
the
              underscore.




    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?37600>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #37600] grep -w cuts words on non-ascii, Flammie Pirinen <=

Prev by Date: Re: [PATCH 0/2] Set PCRE_UTF8 flag correctly for UTF-8 locales
Next by Date: Bug in regexp in grep
Previous by thread: Re: GREP_COLORS variable is not used by grep
Next by thread: Bug in regexp in grep
Index(es):
- Date
- Thread