Re: eight-bit char handling in emacs-unicode

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: eight-bit char handling in emacs-unicode

From:	Simon Josefsson
Subject:	Re: eight-bit char handling in emacs-unicode
Date:	Sat, 15 Nov 2003 04:04:05 +0100
User-agent:	Gnus/5.1003 (Gnus v5.10.3) Emacs/21.3.50 (gnu/linux)

Kenichi Handa <address@hidden> writes:

> In article <address@hidden>, Simon Josefsson <address@hidden> writes:
>> rfc2104.el now works, thanks.  But does the fix really have to
>> explicitly mention charsets like iso-latin-1?  Is there no way to
>> handle binary octet strings in emacs-unicode?  Preferably in a
>> portable way, that works on old Emacs versions and on XEmacs.
>
>>>  This is a typical problem of emacs-unicode in which
>>>  characters 128..255 are valid Unicode characters, thus, for
>>>  instance, (concat '(?a ?\300)) returns a multibyte string of
>>>  `a' and `À'.  But in the current Emacs, it returns a unibyte
>>>  string.
>>> 
>>>  I suspect the similar fix is necessary in several other
>>>  places.
>
>> Having a way to deal with data that is a pure single byte, without
>> involving coding systems, seems like a rather important thing to me.
>
> I agree with you.  Currently, I can think of these methods:

Can you think of one that would work on Emacs 21?  Having a stable
idiom to use to deal with octets would be useful, forcing third-party
packages to try several methods can easily lead to unreadable code.

> (1) Perhaps the easiest way.
>
> Check `default-enable-multibyte-characters' or a newly
> instroduced variable `byte-as-byte' to decide whether a
> integer 128..255 must be treated as a Latin-1 char or a
> byte.   So,
> (concat '(?a ?\300)) => "aÀ" (multibyte string)
> (let ((byte-as-byte t))
>   (concat '(?a ?\300))) => "a\300" (unibyte string)
>
> (2) Introduce a new function `eight-bit-char'.
>
> It converts an argument to ascii or eight-bit-char.
> (eight-bit-char ?a) => 94
> (eight-bit-char ?\300) => 4194240
> Then,
> (concat '(?a (eight-bit-char ?\300))) => "a\300"

Both would work for me, although superficially both look like quick
hacks to me.

> (3) Make a series of new functions (I think it's not good)
>
> concat vs concat-unibyte
> string vs string-unibyte
> aset vs aset-unibyte

I agree it isn't good.

> (4) Most drastic way (the cleanest but requires lots of work)
>
> The basic problem is that we don't distinguish a character
> (code) and a number.  So, we introduce a character object
> (like XEmacs).  The function `character' converts a
> character code into the corresponding character object.  The
> lisp reader always generate a character object for ?a,
> ?\300, etc.   So:
>  (concat '(?a ?\300)) => "aÀ"
>  (concat '(?a #o300)) => "a\300"
>  (concat '(?a (character #o300))) => "aÀ"
>  (concat '(?a #o300 (character #o300))) => "a\300À"
>
> Note: (character X) == (decode-char 'ucs X)

This would be nice.  Characters aren't numbers (unless within the
internal representation, but the internal representation should be
hidden), so separating the two types is useful.  So to be consistent
with that, I think your `character' function should be called
`ucs-character' or similar.

>> It started now, but when I enter a summary buffer it crashed:
>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x081a3c81 in skip_chars (forwardp=1, string=160, lim=36) at syntax.c:1591
>> 1591                      char_ranges[n_char_ranges++] = c;
>> (gdb) bt
>> #0  0x081a3c81 in skip_chars (forwardp=1, string=160, lim=36) at 
>> syntax.c:1591
>
> I just tried gnus but I couldn't reproduce it.  So, I need
> more help.  Could you show me the results of the following?
>
> (gdb) p n_char_ranges
> (gbd) p c
> (gdb) p string
> (gdb) xstring
> (gdb) p *$

I'll try to get time to try emacs-unicode-2 more, but no promises.

Thanks.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: BIG5-HKSCS?, (continued)

Prev by Date: Re: eight-bit char handling in emacs-unicode
Next by Date: Re: eight-bit char handling in emacs-unicode
Previous by thread: Re: eight-bit char handling in emacs-unicode
Next by thread: Re: eight-bit char handling in emacs-unicode
Index(es):
- Date
- Thread