[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: CVS and unicode
From: |
Christian Hujer |
Subject: |
Re: CVS and unicode |
Date: |
Wed, 7 Sep 2005 22:58:37 +0200 |
User-agent: |
KMail/1.7.1 |
Am Dienstag, 6. September 2005 01:17 schrieb Yves Dorfsman:
> Hi,
>
> Has anybody run into problem with GNU CVS and unicode ?
>
> I have made a few tests (with UTF8) and so far it worked, but some of my
> users are saying they did run into problem with some files. I can see how
> some legal UTF8 characters could be confused as control code/binary.
>
> Does anybody have extensive experience with this ?
Yes.
Encoding problems are operating system / editor side.
CVS does not care about anything regarding the encoding.
Somebody wrote:
>> In CVS a Unicode file has to be a Binary file (-kb) - which prevents
>> merging, diffs, etc etc. If you do not define it as -kb then eventually
>> the file will be corrupted.
This is completely wrong and lacks any technical substance.
First of all, Unicode is not the file encoding at all, it's UTF-8 or UTF-16.
Now on the core. UTF-8 files needn't be binary files, in fact, if you want
normal CVS behaviour in the way you're used to it for ASCII text files, they
mustn't be binary files. The byte sequence of Strings like "$Revision$" is
identical in UTF-8 encoded Unicode or plain US ASCII 7.
In fact, all US ASCII 7 encoded files are valid UTF-8 encoded Unicode files
just as well as if you only use the first 128 Unicode code points, your UTF-8
encoded Unicode text is valid ASCII. Even more, the texts are 100% identical
up to the last bit.
Differences occur with extended encodings like ISO-8859-x (e.g. ISO-8859-1 or
ISO-8859-15 etc.) or Windows CP-* (e.g. Windows CP-1252). In these encodings,
the 128 ASCII code points are extended by 128 additional code points with the
high bit set. In UTF-8, the set high bit indicates a multibyte character.
For instance, the lower case umlaut u (as occuring in German, Turkish and some
more languages) has Unicode code point 252. The ISO-8859-1 code point is 252
as well. But the byte sequence in UTF-8 and ISO-8859-1 are different.
In ISO-8859-1, the byte sequence is 0xFC, while in UTF-8 the byte sequence for
the same symbolic character is 0xC3 0xBC.
The issue is not CVS, the issue is telling your editor about the correct file
encoding. It's the text editor and how it interprets byte sequences.
On UNIX, most editors determin the default encoding from the language
environment settings, which can be printed with the locale command. Refer to
your UNIX system manual for more information. Most UNIXoides allow changing
this setting by (warning: this example overrides all) LC_ALL=de_DE.UTF-8 as
an example to set the locale to German for Germany using UTF-8 encoding. Be
warned, only newly started processes (especially terminals!) will use this,
so if you want to always use this, put it somewhere in your .profile
or .bashrc.
The Unicode thingy in CVSNT is just a hack to work around operating system
issues regarding MS Windows.
I'm using UTF-8 in tons of files (all my Java sources are UTF-8 encoded as
well as most of my C++ sources and, of course, all my XML files) for years
now, without any problems.
UTF-16 in fact can be problematic. Normal keyword substitution is likely to
fail at least with some older versions of CVS. I don't know wether newer CVS
uses wchar instead of char for keyword substitution. UTF-16 isn't in
widespread use, so I didn't care about that yet.
Bye
--
Christian Hujer
ITCQIS GmbH
E-Mail: address@hidden
WWW: http://www.itcqis.com/
- CVS and unicode, Yves Dorfsman, 2005/09/05
- Re: CVS and unicode, Sergei Organov, 2005/09/06
- Re: CVS and unicode,
Christian Hujer <=
- Re: CVS and unicode, Arno Schuring, 2005/09/08
- Re: CVS and unicode, Christian Hujer, 2005/09/10
- Re: CVS and unicode, Spiro Trikaliotis, 2005/09/10
- Re: CVS and unicode, Christian Hujer, 2005/09/10
- Message not available
- Re: CVS and unicode, Pierre Asselin, 2005/09/10
- Re: CVS and unicode, Christian Hujer, 2005/09/11
- Re: CVS and unicode, Spiro Trikaliotis, 2005/09/11
- Re: CVS and unicode, Christian Hujer, 2005/09/11
- Re: CVS and unicode, Larry Jones, 2005/09/11
- Re: CVS and unicode, Christian Hujer, 2005/09/11