[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: utf8 vs. latin1
From: |
Chris Sawer |
Subject: |
Re: utf8 vs. latin1 |
Date: |
Sat, 1 Jan 2005 19:55:32 +0000 |
On 1 Jan 2005, at 15:22, Han-Wen Nienhuys wrote:
I thought that you had once told me that that Latin1 is a subset of
UTF-8.
This is not correct. One has to be careful to distinguish between a
character set (ASCII / Latin1 / Unicode) and a mapping (encoding) used
to represent text written using a particular character set in a binary
file.
ASCII (128 characters) and Latin1 (ASCII + a further 128 characters)
are easily represented in binary files with eight bits to a byte, as
each character is simply represented by one byte. However, Unicode has
over 90 000 characters, and there are a number of different mappings
used to represent the characters in binary files.
UTF-8 is a variable length encoding for Unicode, using possibly several
bytes to represent each character. There's a very nice introduction on
Wikipedia:
http://en.wikipedia.org/wiki/UTF8
In short, the first 128 Unicode characters (which coincide with the
ASCII character set) are represented using one byte, of which the first
bit is 0. You could therefore say that ASCII is a subset of UTF-8.
The next 1920 characters are encoded using two bytes, the rules for
which are given on the above page.
Unicode characters 128-255 are the same as Latin1, however when encoded
in UTF-8, two bytes are used to encode them using the rules stated in
the above link.
However, when I save a file as Latin1 and UTF8 under emacs,
then the results differ, and latin1 chars are also saved as double
bytes. Am I missing something?
No, this is expected behaviour. For example:
e = ASCII / Latin1 character 0x65 (101 decimal), Unicode character 0065
= 01100101 in standard ASCII or Latin1
= 01100101 in UTF-8
However:
é = Latin1 character 0xE9 (233 decimal), Unicode character 00E9
= 11101001 in Latin1
= 11000011 10101001 in UTF-8
This is why you are getting different results. However, all modern text
editors should be able to cope with UTF-8, so the above details are
hidden from the user. It is a widely used standard, and is the default
encoding for XML documents.
We use UTF-8 internally to store all of the information about the
pieces on Mutopia. It allows us to very easily store text which could
include characters from any character set in the world. For example, we
recently received a contribution from Matevž Jekovec (the Z has a small
v on top) - his name is correctly recorded on the website, but at
present he is unable to put the correct accent on his name in the
footer using LilyPond.
[For anyone who's interested, the math behind the multi-character
encoding is quite interesting. See the above page for details.]
Did you mean that Latin1 is a subset of Unicode?
This statement is indeed correct.
Or should we be using a different unicode->bytes layout scheme?
I should have thought that UTF-8 is the ideal choice for LilyPond input
files, as it allows the whole Unicode character set to be used, while
retaining compatibility with ASCII for ease of transition.
Chris
--
Chris Sawer - address@hidden - Mutopia team leader
Free sheet music for all at: http://www.MutopiaProject.org/