Re: 23.0.60; [nxml] BOM and utf-8

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 23.0.60; [nxml] BOM and utf-8

From:	David Kastrup
Subject:	Re: 23.0.60; [nxml] BOM and utf-8
Date:	Mon, 19 May 2008 22:57:20 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux)

"Stephen J. Turnbull" <address@hidden> writes:

> David Kastrup writes:
>  > "Stephen J. Turnbull" <address@hidden> writes:
>
>  > > In any case, maintaining faithfulness of representation is simply not
>  > > possible, as you point out
>  > 
>  > With some coding systems.  But the latin-* and utf-* can maintain
>  > the binary stream since their coding is required to be canonical in
>  > the standard.
>
> latin-* will do so because of their extremely limited range.  It's
> unfortunate that programmer intuitions about text have been
> Americanized (== drastically limited) by these encodings.
>
> utf-* can maintain representation in the very limited sense you have
> in mind, and I know that is very useful to you in dealing with non-
> conforming applications like TeX.  However, you still run into the
> problem that faithfulness of representation is not a goal of Unicode.

I am not interested in the "goal of Unicode" but in that of Emacs.
Unicode is about text files.  But Emacs communicates via byte streams
and those are not necessarily text, or necessarily all text.

>  > > It's also not at all obvious that that is a very useful
>  > > requirement when dealing with a character-oriented standard like
>  > > Unicode or XML, since you can expect many applications to
>  > > canonicalize the text "behind your back".
>  > 
>  > That's not an issue.
>
> What do you mean by "that's not an issue?"  How can you know when I
> haven't named the application?

Because we are not talking about what arbitrary applications may do, but
what Emacs should do.  There may be other applications that tend to
garble byte streams, and there might even be some Elisp applications
that garble byte streams.  But that does not mean that the Emacs core
should feel nonchalant about garbling byte streams.

>  > Also you can load, edit and save a text file in colloborative
>  > environments, and the diffs/patches will be just in the edited
>  > areas (this will supposedly work better with Emacs-23 than
>  > Emacs-22).  Those are quite important features.
>
> Sure, and Emacs must provide coding systems that preserve them, and
> generally use those coding systems by default.  Did anybody say
> otherwise?

So what was your point supposed to be?

>  > > Users should get used to it, and we should document how to force
>  > > Emacs to error rather than do anything behind your back for those
>  > > who need binary faithfulness rather than text faithfulness.
>  > 
>  > Since binary faithfulness implies text faithfulness, there is no
>  > reason not to the right thing instead of erroring out.
>
> "There is no reason"?  How arrogant of you!  Rather, "David Kastrup
> lacks the knowledge of the reasons."  Here are three examples:
>
> Binary faithfulness may imply breaking text programs.  For example,
> `forward-char' and `replace-string' will give surprising results in a
> buffer using Unicode internally that contains Unicode in NFD
> normalization (and these anomolies will be noticeable in all Western
> European languages excluding English).

So forward-char and replace-string should be made to work as expected on
non-normalized texts.  One could even normalize texts and use text
properties in order to restore the non-normalized form when
communicating externally.

> Binary faithfulness may imply inefficiency.  For example, files need
> not be normalized, which would imply keeping a copy of the whole file
> and doing a Unicode diff to determine which parts of the file need to
> be saved from the buffer and which parts from the saved copy.

That sounds more like "binary faithfulness may inspire stupidity".  Of
course one needs to look for reasonable implementations.  Inefficiency
has not kept us from moving the Emacs-20.1 MULE model (where buffer and
string offsets were byte-oriented) to the 20.7 model (no idea where the
transition happened exactly) with character-based buffer and string
offsets.  Sometimes one has to balance sanity and efficiency.  And there
are ways for getting a reasonable amount of efficiency back.

> Binary faithfulness may be incompatible with other user demands, for
> example if a user introduces Latin-2 characters into a Latin-9 text.

Why do you think we switched to utf-8 internally and got rid of latin
unification?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

[Prev in Thread]

Current Thread

[Next in Thread]

Re: 23.0.60; [nxml] BOM and utf-8, (continued)

Prev by Date: Re: Updated project-specific settings patch
Next by Date: Re: Updated project-specific settings patch
Previous by thread: Re: 23.0.60; [nxml] BOM and utf-8
Next by thread: Re: 23.0.60; [nxml] BOM and utf-8
Index(es):
- Date
- Thread