[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 'wc -m' and combining characters
From: |
enh |
Subject: |
Re: 'wc -m' and combining characters |
Date: |
Mon, 11 Mar 2024 13:30:30 -0700 |
not particularly? define "character"...
i think Apple/Swift is the only major proponent of "character ==
extended grapheme
cluster[https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries]"?
plan 9's wc used `-r` (for "rune") to make it clearer that it was
counting code points. coreutils (and every other wc, including
Apple's) has the same behavior, but calls it `-m` (for "multibyte").
plan 9 also had `-b` (for "broken", that is: invalid byte sequences),
which to this day i think was a troll (because it's _not_ "b for
bytes", that's "c", except that's not "c for characters", it's "c for
`char`s").
while there _might_ be an argument for adding `-e` (for "extended
grapheme cluster"), i think you'd want to leave `-m` alone for
compatibility, and your `-e` would probably have people asking how
exactly coreutils lets you deal with https://unicode.org/reports/tr15/
and conversions between different forms? :-)
On Mon, Mar 11, 2024 at 12:24 PM Nick <gnu-foo@acrasis.net> wrote:
>
> El 2024-03-11 14:33 PYST, Pádraig Brady escribió:
> > On 10/03/2024 15:16, Nick wrote:
> > > Markus Kuhn's FAQ says "A combining character is not a full
> > > character by itself" but wc is saying that it is, no?
>
> > It's a fair point. Libre Office for example will count as one
> > character.
>
> Thank you. Is wc's behaviour here not considered a bug?
> --
> Nick
> Asunción 16:18 PYST ► 40°C ◆ algo de nubes ◆ 7Km/h NE ◆ 37% HR
>