[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: CVS and unicode
From: |
Christian Hujer |
Subject: |
Re: CVS and unicode |
Date: |
Sat, 10 Sep 2005 12:51:55 +0200 |
User-agent: |
KMail/1.7.1 |
Hi,
Am Donnerstag, 8. September 2005 00:39 schrieb Arthur Barrett:
> Christian,
>
> >>> In CVS a Unicode file has to be a Binary file (-kb) - which prevents
> >>>
> >>> merging, diffs, etc etc. If you do not define it as -kb then
> >>> eventually the file will be corrupted.
> >
> >This is completely wrong and lacks any technical substance.
>
> Firstly don't mistake me for any Unicode/UTF-8/UTG-16 guru - I was
> simply trying to answer the question in a helpful way.
>
> This time I'm just trying to clear up a couple of things about what the
> CVSNT for Linux/Unix/Windows (free / GPL) implmentation of Unicode
> support can and can't do based on Christian's comments. I hope the
> information is helpful to those following the discussion.
Forgetting -kb in UTF-8 files will not result in problems.
There are exactly two issues regarding the -kb thing: line-based diff and
keyword substitution.
The keyword substitution will work 100% fine with UTF-8. Keywords are encoded
in UTF-8 just like ASCII. The byte sequence for $Id$ is the same in ASCII and
UTF-8.
We know the rest in keywords is auto-generated or taken from the OS. The
auto-generated part is ASCII, so it's UTF-8 compatible. The part taken from
the OS, such as paths etc., should either be ASCII or the OS of the server
would better use UTF-8 as encoding.
For UTF-16, there's an extremely small chance that Unicode characters encoded
in UTF-16 are represented as bytes which give a meaningful CVS keyword
substitution byte sequence in ASCII.
>
> > Now on the core. UTF-8 files needn't be binary files, in fact, if you
>
> want
>
> > normal CVS behaviour in the way you're used to it for ASCII text
>
> files, they
>
> > mustn't be binary files.
>
> Yes. And that was the point of my original reply. But you've certainly
> worded it better.
>
> > Differences occur with extended encodings like ISO-8859-x (e.g.
>
> ISO-8859-1 or
>
> > ISO-8859-15 etc.) or Windows CP-* (e.g. Windows CP-1252). In these
>
> encodings,
>
> With CVSNT the file will be checked in/out in UCS-2 (or UTF-16) encoding
> and internally stored as UTF-8 by the server. You can also use an
> extended encoding -- any encoding supported by the client-side iconv
> library can be used. This allows you to specify that a file uses
> ISO-8859-1 and have it converted (by iconv) to the locale used by the
> current client. This way a single user can checkout 10 files that each
> use different extended encodings and not have to change their
> environment variable for each file (and work out what to change it to).
I see many issues regarding the configuration, script usage (these are pretty
solvable) and I see broken behaviour regarding binary files. What if a user
adds a binary file and forgets -kb? Currently, this is nearly never a
problem, just admin -kb and done, the chances that the binary file would have
been corrupted by CVS are extremely low (see my response to Arno Shuring's
post for a mathematical discussion of the chance).
> > The Unicode thingy in CVSNT is just a hack to work around operating
>
> system
>
> > issues regarding MS Windows.
>
> No (but it helps this too) - see your own next comment.
>
> > UTF-16 in fact can be problematic. Normal keyword substitution is
>
> likely to
>
> > fail at least with some older versions of CVS.
>
> Not just keyword substitution, but merges and diffs, line endings etc
> too.
I don't see problems regarding line endings. In UTF-16 it's e.g. 0x00 0x0A or
0x0A 0x00 (depending on wether it's BE or LE) instead of 0x0A. So CVS will
split a char during a diff on UTF-16-LE, but that's not a problem because it
will be the same for every line, so if lines are put together again, the 0x00
and 0x0A bytes will be attatched again.
> All versions of CVS other than CVSNT need to treat UTF-16 files as
> binary.
I agree as long as the chars within the UTF-16 files include those code points
from those panes that are likely to result in keywords when looking at the
resulting byte sequence.
> > uses wchar instead of char for keyword substitution. UTF-16 isn't in
> > widespread use, so I didn't care about that yet.
>
> UTF-16 is the native internal representation of text in the NT based
> versions of Windows (NT/2000/XP/2003) and in the Java and .NET bytecode
> environments, as well as in Mac OS X's Cocoa and Core Foundation
> frameworks.
I was unclear about my wording regarding "use". With "use" I meant use as an
encoding for text files. The internal representation (e.g. char in Java) does
not fall under that category.
Cu :)
--
Christian Hujer
ITCQIS GmbH
E-Mail: address@hidden
WWW: http://www.itcqis.com/
RE: CVS and unicode, Arthur Barrett, 2005/09/05
RE: CVS and unicode, Arthur Barrett, 2005/09/06
RE: CVS and unicode, Arthur Barrett, 2005/09/07
- Re: CVS and unicode,
Christian Hujer <=
Re: CVS and unicode, ai26, 2005/09/10
RE: CVS and unicode, Arthur Barrett, 2005/09/12