[Chicken-users] problems string-trimming on UTF8

From:

Kristian Lein-Mathisen

Subject:

Date:

Fri, 27 Jan 2017 14:36:55 +0100

Dear CHICKEN mailing list,

I encountered a strange issue with string-trim-right and some UTF8 string:

$ csi -R srfi-13 -p '(string-trim "Zazà")'
Zazà

So far so good!

$ csi -R srfi-13 -p '(string-trim-right "Zazà")'
Zaz�

Oh no, what happened?

$ csi -R utf8 -R srfi-13 -p '(string-trim-right "Zazà")'
Zaz�

utf8 doesn't seem to do it! But utf8, at least, gets the string-length right:

$ csi -R srfi-13 -p '(string-length "Zazà")'
5
$ csi -R utf8 -R srfi-13 -p '(string-length "Zazà")'
4

It took me a while to figure out what was going on. These are the bytes of Zazà:

$ printf 'Zazà' | xxd
00000000: 5a61 7ac3 a0 Zaz..

So it seems like string-trim-right just looks at the last byte, \xa0 which is a non-breaking space in itself, and then dropping that off. It should be looking at the last utf8 codepoint instead.

I don't know if this is a known bug or if I've come across something undiscovered. I suppose the fix belongs in the utf8 egg.

Thanks!

[Prev in Thread]

Current Thread

[Next in Thread]