emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: Re: Inadequate documentation of silly characters on screen.


From: Alan Mackenzie
Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Date: Thu, 19 Nov 2009 08:20:40 +0000
User-agent: Mutt/1.5.9i

Morning, Stefan!

On Wed, Nov 18, 2009 at 08:27:24PM -0500, Stefan Monnier wrote:

> The integer 241 is used to represent the char ?ñ, but it's also used for
> many other things, one of them being to represent the byte 241 (tho such
> a byte can also be represented as the integer 4194289).

> Now strings come in two flavors: multibyte (i.e. sequences of chars) and
> unibyte (i.e. sequences of bytes).  So when you do:

>    M-: (setq nl "\n")
>    M-: (aset nl 0 ?ñ)
>    M-: (insert nl)

> The `aset' part may do two different things depending on whether `nl' is
> unibyte or multibyte: it will either insert the char ?ñ or the byte 241.
> In the above code the "\n" is taken as a unibyte string, tho I'm not
> sure why we made this arbitrary choice.

The above sequence "works" in Emacs 22.3, in the sense that "ñ" gets
displayed - when I do M-: (aset nl 0 ?ñ), I get

   "2289 (#o4361, #x8f1)" (Emacs 22.3)
   "241 (#o361, #xf1)"    (Emacs 23.1)

displayed in the echo area.  So my `aset' invocation is trying to write a
multibyte ?ñ into a unibyte ?\n, and gets truncated from #x8f1 to #xf1 in
the process.  Surely this behaviour in Emacs 23.1 is a bug?  Shouldn't we
fix it before the pretest?  How about interpreting "\n" and friends as
multibyte or unibyte according to the prevailing flavour?

> If you give us more context (i.e. more of the real code where the
> problem show up), maybe we can tell you how to avoid it.

OK.  I have my own routine to display regexps.  As a first step, I
translate \n -> ñ, (and \t, \r, \f similarly).  This is how:

    (defun translate-rnt (regexp)
      "REGEXP is a string.  Translate any \t \n \r and \f characters
    to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n
    to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3).
    The original string is modified."
      (let (ch pos)
        (while (setq pos (string-match "[\t\n\r\f]" regexp))
          (setq ch (aref regexp pos))
          (aset regexp pos                        ; <===================
                (cond ((eq ch ?\t) ?Î)
                      ((eq ch ?\n) ?ñ)
                      ((eq ch ?\r) ?®)
                      (t           ?£))))
        regexp))



> Usually, I recommend to stay away from `aset' on strings for various
> reasons, and it seems that it also helps avoid those tricky issues (tho
> it doesn't protect you from them completely).

Again, surely this is a bug?  These tricky issues should be dealt with in
the lisp interpreter in a way that lisp hackers don't have to worry
about.  Why do we have both unibyte and multibyte?  Is there any reason
not to remove unibyte altogether (though obviously not for 23.2).

What was the change between 22.3 and 23.1 that broke my code?  Would it,
perhaps, be a good idea to reconsider that change?

>         Stefan

-- 
Alan Mackenzie (Nurmberg, Germany).




reply via email to

[Prev in Thread] Current Thread [Next in Thread]