bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-c

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-c

From:	Drew Adams
Subject:	bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-char-display
Date:	Sat, 3 Nov 2012 12:01:29 -0700

> > Just why is it that the regexp "[\240]+" does not match this char?
> > Why should a character-alternative expression care whether the
> > representation is unibyte or multibyte?  Isn't that a bug?
> 
> When \240 occurs in a unibyte string, Emacs recognizes it as an
> eight-bit raw byte.  When converting unibyte strings to 
> multibyte, Emacs does not "unify" eight-bit raw bytes with
> Unicode characters #x80-#xff; they get their own code points,
> in this case #x3fffa0.

I think I understand this (but I might be misunderstanding).  The \240 in the
4-char ASCII regexp string "\240" is interpreted (read?) as a raw byte, not as
the char I wanted.

That is, the literal string in my code is read as a string that contains only a
single raw byte of octal 240 in place of the 4 chars \240 (and instead of as a
string with the multibyte char no-break space).  Is that right?

And putting that together with Eli's statement about insertion ("'insert' treats
strings such as "\nnn" as unibyte strings"), I understand that the buffer text
after I type `C-q 240' contains a unibyte raw byte, and not the multibyte char
no-break space.

But in that case I do not understand why `C-u C-x =' says that it _is_ the
Unicode no-break space char.  And I do not understand why Yidong's font-lock
correction also shows that it is a no-break space char.

So I'm confused about what is actually in the buffer.  From the doc and from
Eli's statement, I gather that there is a unibyte raw byte (octal 240) at that
position.  But `C-u C-x =' and font-lock seem to tell me that there is a
(multibyte) no-break space char there.

If there is in fact a multibyte char there and the literal "\240" in my
font-lock sexp results in a unibyte raw byte search, that would explain the
mismatch.

But I still wonder about this motivation for the treatment of \nnn in literal
strings in Lisp code:

> (One reason for doing this is to allow unibyte strings to
> be specified using string constants in Emacs Lisp source code.)

I can see how that can be useful.  But I can also see how it would be useful to
have some way of using octal syntax to match multibyte chars.  Isn't there some
reasonable way to allow for both?

E.g. can I specify a multibyte string somehow, starting with octal syntax?  Is
there a way, for example, to use octal sytax to provide octal codes 0302 and
0240, which together define U+00AO for UTF8?  [See below.]

Is there, for example, (or could there be added) a function that one can apply
to the unibyte string for \240 that would convert it to a string that DTRT wrt
multibyte?

So I could do something like this (assuming the function is available for older
Emacs versions too), where `foo' is the function:

(font-lock-add-keywords nil `((,(foo "\240+") (0 'foo t))) 'APPEND)

>From the doc, I was thinking that perhaps `string-to-multibyte' would do the
trick, i.e., (string-to-multibyte "\240+") would return "\u00a0+" or the literal
Unibyte char in a multibyte string.  But it returns "\240+".

I can understand that the actual chars in that input string are all ASCII, so
that makes sense, I guess.  But I was thinking from Yidong's statement above
that such a literal string in Lisp code gets read as a unibyte, raw-byte string.

Since that doesn't seem to be the case here (?), is there a function that will
convert "\240" (4 chars) to a string with just that one "eight-bit raw byte"
char?  I tried `read', but that didn't help.

I hope I'm just missing something, and that there is a function (or combination
of functions) to which I can pass the 4-char ASCII string "\240" (or the 8-char
string "\302\240") and that will return the proper multibyte string containing
the Unicode no-break space char.

Ideal would be such a function that works also in older Emacs versions.

...

OK, digging some more, it seems that this will do the trick:

(decode-coding-string "\302\240" 'utf-8)

That allows use of only octal syntax - good.  But it still doesn't solve the
problem for older Emacs versions - they raise the error (coding-system-error
utf-8).

Is there a way to use only octal syntax with older Emacs versions, so the
font-locking code highlights such a Unicode char in a file/buffer?

Judging by my current confusion, I am sure that my statements above must be full
of misconceptions.  I will be glad to be shown my misunderstanding and a simple
solution.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-char-display, (continued)
- bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-char-display, Andreas Schwab, 2012/11/03
  - bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-char-display, Drew Adams, 2012/11/03

Prev by Date: bug#12791: 24.2; An option to load user init file with -batch
Next by Date: bug#12703: segfault in 0x0000000100087491 in char_table_ref () at chartab.c:233 (emacs-24.2.50.1, osx-10.8.2)
Previous by thread: bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-char-display
Next by thread: bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-char-display
Index(es):
- Date
- Thread