[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#27681: grep: Combining Mark-Nonspacing are classified as [:punct:]
From: |
Santiago |
Subject: |
bug#27681: grep: Combining Mark-Nonspacing are classified as [:punct:] |
Date: |
Thu, 13 Jul 2017 15:21:40 +0200 |
Hi,
I would like to forward the issue below, reported by Panu Kalliokoskii
in 2012 (better late than never!). I think the correct category is
Mark-nonspacing, but I am not very familiar with Unicode though.
It still occurs in grep 3.1. In this case, using the U+0301 acute accent:
$ echo árbol | grep -o '[[:alpha:]]*'
a
rbol
Cheers,
-- Santiago
On Mon, 05 Mar 2012 13:08:43 +0200 "Panu A. Kalliokoski" <address@hidden> wrote:
> Package: grep
> Version: 2.6.3-3
> Severity: normal
>
>
> It seems that grep misclassifies combining letters (unicode class Lm) as
> punctuation, when they should be letters. For instance:
>
> $ echo d̪ʌ̀lì | grep -o '[[:alpha:]]*'
> d
> ʌ
> li
>
> As a consequence, combining accents are not seen as "word-constituent":
>
> $ echo d̪ʌ̀lì | grep -o '\w*'
> d
> ʌ
> li
>
> This causes also false positives on word-boundary conditions, such as
> the below:
>
> $ echo d̪ʌ̀lì | grep -w ʌ
> d̪ʌ̀lì
>
> I suggest that combining letters should be part of [:alpha:] instead of
> [:punct:].
- bug#27681: grep: Combining Mark-Nonspacing are classified as [:punct:],
Santiago <=