coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 'wc -m' and combining characters


From: enh
Subject: Re: 'wc -m' and combining characters
Date: Mon, 11 Mar 2024 13:30:30 -0700

not particularly? define "character"...

i think Apple/Swift is the only major proponent of "character ==
extended grapheme
cluster[https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries]";?

plan 9's wc used `-r` (for "rune") to make it clearer that it was
counting code points. coreutils (and every other wc, including
Apple's) has the same behavior, but calls it `-m` (for "multibyte").
plan 9 also had `-b` (for "broken", that is: invalid byte sequences),
which to this day i think was a troll (because it's _not_ "b for
bytes", that's "c", except that's not "c for characters", it's "c for
`char`s").

while there _might_ be an argument for adding `-e` (for "extended
grapheme cluster"), i think you'd want to leave `-m` alone for
compatibility, and your `-e` would probably have people asking how
exactly coreutils lets you deal with https://unicode.org/reports/tr15/
and conversions between different forms? :-)

On Mon, Mar 11, 2024 at 12:24 PM Nick <gnu-foo@acrasis.net> wrote:
>
> El 2024-03-11 14:33 PYST, Pádraig Brady escribió:
> > On 10/03/2024 15:16, Nick wrote:
> > >  Markus Kuhn's FAQ says "A combining character is not a full
> > > character by itself" but wc is saying that it is, no?
>
> > It's a fair point. Libre Office for example will count as one
> > character.
>
> Thank you.  Is wc's behaviour here not considered a bug?
> --
> Nick
> Asunción 16:18 PYST ►  40°C  ◆  algo de nubes  ◆  7Km/h NE  ◆  37% HR
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]