octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #49222] octave-io 2.4.3: xls2oct with "OCT" i


From: Markus Mützel
Subject: [Octave-bug-tracker] [bug #49222] octave-io 2.4.3: xls2oct with "OCT" interface lost the ability to read german umlauts or °
Date: Sat, 17 Dec 2016 20:33:18 -0000
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0

Follow-up Comment #19, bug #49222 (project octave):

Philip,

I am no longer sure whether this was the right thing to do.
Judging from the link [1] you provided in bug #49348, I guess that UTF-8 is
Octave's "default encoding" (if it can be considered to have a default
encoding).
That would mean we shouldn't convert to "Unicode". In fact, I guess the
current approach only worked because QT seems to interpret invalid UTF-8 as
ISO-8859-1 (latin-1). At least this is what I guess after reading [2]. This
also explains why the command window would display fine after switching to
codepage 1252 (which basically is the same as ISO-8859-1 for its printable
characters).

This seems to leave us with 3 possibilities for reading strings from the
file:
1- Read the byte stream from the xls and treat it directly as UTF-8.
2- Read the byte stream from the xls, check whether it is valid UTF-8 and
strip invalid sequences.
3- Read the encoding and byte stream from the xls, convert it from that
encoding to UTF-8.

After that, it might be possible to convert the resulting strings to whatever
codepage the user has set.

I would like to vote for option 1, until a user stumbles over a file that
isn't encoded in UTF-8 and complains. I would also prefer to always use UTF-8
in Octave and consider it a bug if that does not display right. It looks like
QT5 does a much better job of supporting that encoding even in Windows.

The same holds for writing to files: Octave strings should always be encoded
in UTF-8. Thus, it should be possible to directly write them to the file. But
it might be reasonable to check for invalid sequences to avoid problems like
bug #46855.

Sorry for providing you with wrong patches.

What do you think what should best be done?

[1]
http://wiki.octave.org/International_Characters_Support#The_state_of_Octave
[2] https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?49222>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]