bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]


From: Eli Zaretskii
Subject: bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps [PATCH]
Date: Fri, 28 Jun 2019 16:03:54 +0300

> From: Mattias Engdegård <mattiase@acm.org>
> Date: Fri, 28 Jun 2019 14:41:51 +0200
> Cc: 3687@debbugs.gnu.org
> 
> Let's assume the following semantics as desirable:
> 
> 1. All characters and raw bytes (up to regexp syntax) match themselves no 
> matter whether they are given as literals or in character alternatives.
> 2. All raw bytes C match themselves and nothing else no matter whether the 
> pattern or target string/buffer are unibyte or multibyte.
> 3. Ranges from ASCII to raw bytes work as expected and do not contain Unicode 
> characters above U+007F.
> 4. Ranges from non-ASCII Unicode characters to raw bytes make no sense and 
> are treated as empty.
> 
> Here is a patch.

Thanks.

However, I don't want to look at the patch before we discuss and agree
on the principles.  So please consider expanding your principles to
answer the following questions:

 1. What do you mean by "raw bytes"?  Is #xab a raw byte or a Unicode
    point U+00AB?  IOW, how do we distinguish, in a regexp, between a
    raw byte and a character whose Unicode codepoint is that byte's
    value?  And how does one go about concocting a regexp that matches
    raw bytes in a unibyte or multibyte buffer or string?

 2. What is meant by "ranges from ASCII to raw bytes"?  Which
    characters are included in such ranges?

 3. If ranges from non-ASCII characters to raw bytes make no sense,
    how would one go about specifying a range that includes all the
    characters and raw bytes supported by Emacs?

When we discuss these issues, let's please be on the same page
regarding the handling of raw bytes in current Emacs.  Specifically:

  . Raw bytes are internally treated as "characters" whose Unicode
    codepoints are in the range [#x3fff00..#x3fffff].
  . The internal representation of raw bytes in buffers and strings
    uses 2-byte sequences that begin with #xc0 or #xc1.
  . Emacs jumps through hoops to never expose the above internals to
    th external world.  Thus, any encoding of a string with raw bytes
    will convert them to their single-byte representation, where they
    are indistinguishable from the characters which have the same
    codepoints, and many operations other than encoding also
    silently perform these conversions.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]