Re: utf8 vs. latin1

lilypond-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 vs. latin1

From:	Chris Sawer
Subject:	Re: utf8 vs. latin1
Date:	Sat, 1 Jan 2005 19:55:32 +0000

On 1 Jan 2005, at 15:22, Han-Wen Nienhuys wrote:

I thought that you had once told me that that Latin1 is a subset of
UTF-8.

This is not correct. One has to be careful to distinguish between acharacter set (ASCII / Latin1 / Unicode) and a mapping (encoding) usedto represent text written using a particular character set in a binaryfile.

ASCII (128 characters) and Latin1 (ASCII + a further 128 characters)are easily represented in binary files with eight bits to a byte, aseach character is simply represented by one byte. However, Unicode hasover 90 000 characters, and there are a number of different mappingsused to represent the characters in binary files.

UTF-8 is a variable length encoding for Unicode, using possibly severalbytes to represent each character. There's a very nice introduction onWikipedia:

http://en.wikipedia.org/wiki/UTF8

In short, the first 128 Unicode characters (which coincide with theASCII character set) are represented using one byte, of which the firstbit is 0. You could therefore say that ASCII is a subset of UTF-8.

The next 1920 characters are encoded using two bytes, the rules forwhich are given on the above page.

Unicode characters 128-255 are the same as Latin1, however when encodedin UTF-8, two bytes are used to encode them using the rules stated inthe above link.

However, when I save a file as Latin1 and UTF8 under emacs,
then the results differ, and latin1 chars are also saved as double
bytes. Am I missing something?


No, this is expected behaviour. For example:

e = ASCII / Latin1 character 0x65 (101 decimal), Unicode character 0065
  = 01100101 in standard ASCII or Latin1
  = 01100101 in UTF-8

However:

é = Latin1 character 0xE9 (233 decimal), Unicode character 00E9
  = 11101001 in Latin1
  = 11000011 10101001 in UTF-8

This is why you are getting different results. However, all modern texteditors should be able to cope with UTF-8, so the above details arehidden from the user. It is a widely used standard, and is the defaultencoding for XML documents.

We use UTF-8 internally to store all of the information about thepieces on Mutopia. It allows us to very easily store text which couldinclude characters from any character set in the world. For example, werecently received a contribution from Matevž Jekovec (the Z has a smallv on top) - his name is correctly recorded on the website, but atpresent he is unable to put the correct accent on his name in thefooter using LilyPond.

[For anyone who's interested, the math behind the multi-characterencoding is quite interesting. See the above page for details.]

Did you mean that Latin1 is a subset of Unicode?


This statement is indeed correct.

Or should we be using a different unicode->bytes layout scheme?

I should have thought that UTF-8 is the ideal choice for LilyPond inputfiles, as it allows the whole Unicode character set to be used, whileretaining compatibility with ASCII for ease of transition.


Chris

--

Chris Sawer - address@hidden - Mutopia team leader
Free sheet music for all at:  http://www.MutopiaProject.org/

[Prev in Thread]

Current Thread

[Next in Thread]

utf8 vs. latin1, Han-Wen Nienhuys, 2005/01/01
- Re: utf8 vs. latin1, Chris Sawer <=
- Re: utf8 vs. latin1, Werner LEMBERG, 2005/01/02

Prev by Date: UTF8 and (La)TeX backend?
Next by Date: CVS 2004-12-29 22:58 compilation error
Previous by thread: utf8 vs. latin1
Next by thread: Re: utf8 vs. latin1
Index(es):
- Date
- Thread