bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255

From:	Nelson H. F. Beebe
Subject:	bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255
Date:	Tue, 9 Mar 2010 12:51:31 -0700 (MST)

When emacs-23 came out and I began to use it, I noticed problems in
some of my extensive locally-written emacs code.

I've been far too busy to try to track down why, and sometimes, the
problems were resolved simply by rerunning byte-compile-file.

This morning, I set out to track down the source of one of the
problems in a function that I use a lot, and eventually narrowed it to
the failure of functions like these:

    (string-equal (buffer-substring (point) (1+ (point))) "\377")
    (looking-at "\377")

In emacs-22 and earlier, if the character at point is octal 377
(decimal 255, hexadecimal 0xff), this function returns t.  In
emacs-23, it returns nil.  Further testing shows identical behavior
for characters in the decimal range 128--255 (octal \200--\377).

I suspect the reason is this comment in the NEWS file:

    The internal encoding used for buffers and strings is now
    Unicode-based and called `utf-8-emacs' (`emacs-internal' is an alias
    for this).  This encoding is backward-compatible with Unicode's UTF-8
    encoding.  The internal encoding previously used by Emacs,
    `emacs-mule', is still available for reading and writing files.

The code in question uses the character ?\377 as a unique sentinel
that terminates the function's processing.  It needs to be a
nonprintable character that is not use in normal text files, and I
found that changing it to ?\177 (ASCII DELete) made the code work
properly.  That change is transparent to older emacs versions, so in
this case, it is harmless.  Nevertheless, since the technique of using
data sentinels is an ancient practice in many programing languages, I
suspect that my own code is not the only Emacs Lisp code to be
affected by the change.

The question for this list is this:

    If UTF-8 is used internally in the buffer text, then why are
    numeric representations of unprintable characters in search
    strings apparently not translated the same way?

In all of my Emacs Lisp source code files, the character set is plain
ASCII, which is a proper subset of UTF-8, requiring only a single byte
per character.

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe@math.utah.edu  -
- 155 S 1400 E RM 233                       beebe@acm.org  beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------

[Prev in Thread]

Current Thread

[Next in Thread]

bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255, Nelson H. F. Beebe <=
- bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255, Stefan Monnier, 2010/03/09

Prev by Date: bug#5696: update
Next by Date: bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255
Previous by thread: bug#5699: dissociated-press should act on the current region
Next by thread: bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255
Index(es):
- Date
- Thread