bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255


From: Eli Zaretskii
Subject: bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255
Date: Thu, 07 Jul 2016 19:21:47 +0300

> From: npostavs@users.sourceforge.net
> Date: Wed, 06 Jul 2016 19:52:16 -0400
> Cc: "Nelson H. F. Beebe" <beebe@math.utah.edu>, 5700@debbugs.gnu.org
> 
> With Emacs 24/25, using "\u00FF" works:
> 
> (string-equal (buffer-substring (point) (1+ (point))) "\u00FF")
> (looking-at "\u00FF")
> 
> Seems to be another instance of the unibyte vs multibyte string escape syntax 
> thing:
> 
>        You can also use hexadecimal escape sequences (‘\xN’) and octal
>     escape sequences (‘\N’) in string constants.  *But beware:* If a
>     string constant contains hexadecimal or octal escape sequences, and
>     these escape sequences all specify unibyte characters (i.e., less
>     than 256), and there are no other literal non-ASCII characters or
>     Unicode-style escape sequences in the string, then Emacs
>     automatically assumes that it is a unibyte string.  That is to say,
>     it assumes that all non-ASCII characters occurring in the string are
>     8-bit raw bytes.
> 
> Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
> > which seems acceptable, whereas under Emacs-23 we have:
> >
> [...]
> >   (multibyte-string-p "\377")   prints as    "\377"
> 
> In 23.4 it returns returns nil

Yes.

The other significant piece of the puzzle is described in this text
from the ELisp manual:

     For technical reasons, a unibyte and a multibyte string are ‘equal’
     if and only if they contain the same sequence of character codes
     and all these codes are either in the range 0 through 127 (ASCII)
     or 160 through 255 (‘eight-bit-graphic’).  However, when a unibyte
     string is converted to a multibyte string, all characters with
     codes in the range 160 through 255 are converted to characters with
     higher codes, whereas ASCII characters remain unchanged.  Thus, a
     unibyte string and its conversion to multibyte are only ‘equal’ if
     the string is all ASCII.  Character codes 160 through 255 are not
     entirely proper in multibyte text, even though they can occur.  As
     a consequence, the situation where a unibyte and a multibyte string
     are ‘equal’ without both being all ASCII is a technical oddity that
     very few Emacs Lisp programmers ever get confronted with.  *Note
     Text Representations::.

This was one of the significant changes in Emacs 23, and I think it is
the main factor for the changed behavior reported by Nelson.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]