Re: utf8 and emacs text/string multibyte representation

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 and emacs text/string multibyte representation

From:	Eli Zaretskii
Subject:	Re: utf8 and emacs text/string multibyte representation
Date:	Wed, 29 Oct 2014 18:19:07 +0200

> From: Camm Maguire <address@hidden>
> Cc: address@hidden,  address@hidden
> Date: Wed, 29 Oct 2014 11:55:13 -0400
> 
> I thought there would be a little more on the upside, say some benefit
> from having the internal representation be the same as that used in many
> external representations, at least on linux

Yes, that too.  Emacs originally used a very different internal
encoding (ISO-2022 based), and the switch to UTF-8 based was due to
the above.  In general, having a Unicode basis works better when you
want to support various Unicode defined features, like the UCA etc.

> and perhaps some algorithm coalescing with straightforward byte-wise
> operations.

Not sure what you mean here, please elaborate.  In general, many
operations with UTF-8 strings can use the usual string library
functions, as you probably know very well.

> Does every string access in emacs proceed through the utf8 decoder?

If you need to look at the character, yes.  E.g., if you need some
property of the character, you need to index the appropriate table by
that character's codepoint.  But in most operations that is not
needed.  You just need to recognize several specific characters, like
the null character, the slash, etc., most of which are ASCII.

> >> A cached internal pointer storing the last referenced codepoint
> >> offset makes access essentially O(1).
> >
> > We indeed maintain a cache for byte-to-character and character-to-byte
> > conversions.
> 
> How big is this cache?

Its size is dynamic, and depends on how frequently the conversion is
needed in places that are far away.  The cache stores byte-to-char
correspondence in places that are far away, and Emacs uses binary
search in between them.

> >> Yet setting string elements can trigger reallocations/memmove
> >> operations.
> >
> > Emacs, as every editor, needs to handle this efficiently anyway,
> > because editing operations rarely leave the buffer size unchanged.  So
> > Emacs uses a gap to minimize reallocations.
> >
> 
> But no gap in strings, right (i.e. just buffers)?

Right.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Referring to revisions in the git future., (continued)

Prev by Date: Re: utf8 and emacs text/string multibyte representation
Next by Date: Re: utf8 and emacs text/string multibyte representation
Previous by thread: Re: utf8 and emacs text/string multibyte representation
Next by thread: Re: utf8 and emacs text/string multibyte representation
Index(es):
- Date
- Thread