octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

multibyte representation of character codes


From: sergey plotnikov
Subject: multibyte representation of character codes
Date: Fri, 13 Dec 2013 14:20:32 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.1

Hello all,

I've just bumped into interesting and non-intuitive behaviour of some Octave commands when processing strings contained Cyrillic characters.

* size commands (length(), size(), numel()..)
they are actually checking not the number of characters in the string, but the length of corresponding character codes vector, so that length("лыжи") = 8 = length(double("лыжи")) as all the cyrillic letters have multibyte char codes in Octave

* string comparison
as far as i understand it's just a consequence of previous one. so when comparing strings one should think not about number of characters but about length of char codes:
strncmp("2 лЫжи", "2 лУжи",5)
ans = 1

since first bytes of char codes of "Ы" and "У" are equal

* use in regexp()
this one is a bit more tricky.. there's almost no possibility to use regexp() with strings containing multibyte-char-code characters.
just an example:
regexp('лыжи','[ы]','match')
ans =
{
  [1,1] =
  [1,2] = �
}

where
char([ans{1} ans{2}])
ans = 'ы'

that would be fine if holds for all the patterns of that type, but:
regexp('лыжи','[л]','match')
ans =
{
  [1,1] =
  [1,2] = �
  [1,3] =
  [1,4] =
}




so this is definitely not what one could get using matlab. and it's not only about localization, but mostly about using multibyte character codes, which is not the case in matlab, as its range of char codes is 0:65535. in my opinion, matlab's range is quite enough to have exercises like previous working well for the most of characters.

so my question is - is it just the feature, which one should consider when working with strings in Octave or a bug which can be addressed at some point?

best regards,
sergey

reply via email to

[Prev in Thread] Current Thread [Next in Thread]