guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/3] Make string-length documentation more correct


From: tomas
Subject: Re: [PATCH 1/3] Make string-length documentation more correct
Date: Wed, 26 Jun 2024 14:07:49 +0200

On Wed, Jun 26, 2024 at 01:46:28PM +0200, Maxime Devos wrote:
> 
> >>  >-Returns the number of characters in the given @var{string}.
> >> +Returns the number of bytes in the given @var{string}.
> >>  
> >> This is false. For example, (string-length "šŸ˜€") is 1, whereas in all 
> >> encodings I know of it is >more than one byte. Also, R5RS says: [...]
> >
> >Maybe `the number of codepoints` will work here.
> >
> >(string-length "šŸ‘Øā€šŸ­") ;; => 3
> >(string-length "eĢ") ;; => 2
> >
> >The number of characters here is 1 in both cases.
> 
> No, in Unicode (and Guile equates character=Unicode character) all characters 
> correspond to a single codepoint.

It's more subtle than that: Unicode knows about "combining characters",
so it's quite possible that Andrew's "Ć©" consists of two code points
(FWIW, it arrives to me as just one, but perhaps there was some
canonicalization [1] step in between).

ISTR that "Unicode character" is actually synonymous the same than "Unicode
code point" -- but the common meaning of "character" is more fuzzy. Perhaps
it's wise to avoid that word when trying to be precise.

Cheers

[1] https://en.wikipedia.org/wiki/Unicode_normalization

-- 
t

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]