[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255
From: |
Nelson H. F. Beebe |
Subject: |
bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255 |
Date: |
Tue, 9 Mar 2010 12:51:31 -0700 (MST) |
When emacs-23 came out and I began to use it, I noticed problems in
some of my extensive locally-written emacs code.
I've been far too busy to try to track down why, and sometimes, the
problems were resolved simply by rerunning byte-compile-file.
This morning, I set out to track down the source of one of the
problems in a function that I use a lot, and eventually narrowed it to
the failure of functions like these:
(string-equal (buffer-substring (point) (1+ (point))) "\377")
(looking-at "\377")
In emacs-22 and earlier, if the character at point is octal 377
(decimal 255, hexadecimal 0xff), this function returns t. In
emacs-23, it returns nil. Further testing shows identical behavior
for characters in the decimal range 128--255 (octal \200--\377).
I suspect the reason is this comment in the NEWS file:
The internal encoding used for buffers and strings is now
Unicode-based and called `utf-8-emacs' (`emacs-internal' is an alias
for this). This encoding is backward-compatible with Unicode's UTF-8
encoding. The internal encoding previously used by Emacs,
`emacs-mule', is still available for reading and writing files.
The code in question uses the character ?\377 as a unique sentinel
that terminates the function's processing. It needs to be a
nonprintable character that is not use in normal text files, and I
found that changing it to ?\177 (ASCII DELete) made the code work
properly. That change is transparent to older emacs versions, so in
this case, it is harmless. Nevertheless, since the technique of using
data sentinels is an ancient practice in many programing languages, I
suspect that my own code is not the only Emacs Lisp code to be
affected by the change.
The question for this list is this:
If UTF-8 is used internally in the buffer text, then why are
numeric representations of unprintable characters in search
strings apparently not translated the same way?
In all of my Emacs Lisp source code files, the character set is plain
ASCII, which is a proper subset of UTF-8, requiring only a single byte
per character.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu -
- 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
- bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255,
Nelson H. F. Beebe <=