emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 and emacs text/string multibyte representation


From: David Kastrup
Subject: Re: utf8 and emacs text/string multibyte representation
Date: Sat, 01 Nov 2014 19:41:22 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux)

"Stephen J. Turnbull" <address@hidden> writes:

> Eli Zaretskii writes:
>
>  > > Been discussing this elsewhere, and its come to my attention that not
>  > > only do all unicode code-points not fit into UTF-16, but all unicode
>  > > characters don't fit into unicode code-points :-).  Presumably this is
>  > > why emacs expanded to 22bits?
>  > 
>  > Not sure what you mean here.  All Unicode characters do fit into the
>  > Unicode codepoint space.  Emacs extends that codepoint space beyond 22
>  > bits because it needs to support cultures which don't want unification
>  > yet.
>
> I suppose he means grapheme complexes, such as various accented
> characters that can be constructed from composing characters but do
> not have precomposed forms in Unicode.  As you say, that's not why
> Emacs extended the code space.
>
>  > > Did you consider leaving aref, char-code and code-char alone and writing
>  > > unicode functions on top of these, i.e. unicode-length!=length, as
>  > > opposed to making aref itself do this translation under the hood,
>  > > thereby violating the expectation of O(1) access, (which is certainly
>  > > offered in other kinds of arrays, though it is questionable whether real
>  > > users actually expect this for strings)?
>
> Actually, originally Emacs allowed you to treat text (buffers and
> strings) either as sequences of characters or arrays of bytes, and
> this was a real bug-breeder (and why XEmacs chose the pain of the
> incompatible separation of integer type from character type).
>
> I'm not sure if the feature is present in modern Emacs, but at the
> very least the usage is so rare today that I'm unaware of any.

string-as-unibyte and string-as-multibyte most certainly are available
for going from one to the other.  But the commands working on either
unibyte or multibyte strings are the same.  Similar for buffers.  I have
no idea whether this is a problem vector for creating inconsistent
multibyte content.  I could imagine it to be, but so could be
user-created CCL programs for code conversion.

-- 
David Kastrup




reply via email to

[Prev in Thread] Current Thread [Next in Thread]