Re: lynx-dev changes for Japanese (was: dev.16 patch)

lynx-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev changes for Japanese (was: dev.16 patch)

From:	Hataguchi Takeshi
Subject:	Re: lynx-dev changes for Japanese (was: dev.16 patch)
Date:	Tue, 14 Dec 1999 00:04:38 +0900 (JST)
Thank you very much for your comment.

On Sun, 12 Dec 1999, Klaus Weide wrote:

> What I was specifically thinking of was the discovery of
> SUPPORT_MULTIBYTE_EDIT.  (I ws recently looking at LYStrings.c to
> get an idea how hard it would be tu add better support for UTF-8 to
> the line-editor.  Hadn't paid much attention to those #ifdef'd sections
> there before - there's so much of it, I tend to blend it out.)
> SUPPORT_MULTIBYTE_EDIT seems to be defined only in two of the
> Windows-specifc makefiles, makefile.msc and makefile.bcb, and
> nowhere else.  It is not mentioned in INSTALLATION, README.defines,
> not explained in the makefiles, not ifdef'd with any *_EX, and I
> couldn't find it mentioned in CHANGES, so I wonder where it's coming
> from.

It was in the Hiroyuki's patch, so I think it was merged in 2-8-3dev.4.
Originally it seems to be written by Takeshi Sasaki
    http://katayama-www.cs.titech.ac.jp/~sasaki/lynx_jp/patches.html
    (In Japanese)

> Anyway, it uses a hardwired IS_KANA macro that seems to be completely
> Shift_JIS specific.  I think it should test the current display
> character set instead.  Something like
> 
> #define IS_KANA(c) (HTCJK==JAPANESE && current_char_set == SHIFT_JIS &&\
>                     0xa0 <= c && c <= 0xdf)

I think you are right. And half width kana in EUC-JP should be 
also cared.
# I defined IS_SJIS_HWKANA and IS_EUC_HWKANA macro in HTCJK.h.

> Is it true that, once Japanese text is in the HText structure, it is
> always converted to the right d.c.s., i.e., either EUC-JP or Shift_JIS?

Yes, for the almost all text except the part Lynx failes to detect 
its character set (I belive this is rare).

> But I don't understand who needs the SUPPORT_MULTIBYTE_EDIT code.  It
> seems to me every CJK charset user should need it, is that not the case?
> If it's true, then there shouldn't even be a special macro
> SUPPORT_MULTIBYTE_EDIT.  And it should of course not be Windows-specific.

I think so too.
Should it be ifdef'd CJK_EX instead of SUPPORT_MULTIBYTE_EDIT?

> This leaves me wondering about the most basic functioning of line-editing
> with CJK display character sets.  What happens when you delete, for
> example with the backspace key, one half of a multibyte character, without
> that special code?  Does it work at all?

It doesn't work well.
We have to press backspace key twice to delete a multibyte character.
When pressing once, another 1byte character is shown.

> > There is a code for half width katakana, but we don't always have
> > fonts for it. So I think it's better Lynx converts half width katakana
> > into full width to display. If CJK_EX is defined,
> > Lynx actually does for almost all cases.
> > # Lynx didn't convert in the source mode, but my patch will improve it.
> 
> Does that also apply to text/plain files?

Yes.

> You may also want to check the source mode variants, -preparsed and
> -prettysrc.

I tried. These option are very fun! I found it works as I expected. :-)

> > It seems (WHEREIS search) highlighting works well if CJK_EX is defined,
> > but it doesn't if not defined because half width katakana can be
> > in the screen. I think Lynx should always convert half width katakana
> > into full width. Are there any side effects?
> 
> Maybe with alignment of preformatted text, including text/plain?

Including text/plain.

> Should this be a separate (configuration?) option, rather than everything
> being covered by CJK_EX?

I'm not sure. I don't know whether defining CJK_EX causes bad effects
for not CJK environment or not. If not, can CJK_EX be default?

> Is it too hard to deal with half width katakana in all the necessary
> places, rather than "forbidding" it?  I assume it would be a lot of
> work.

I don't know exactly. It may be lot.

> But it may also ba a lot of work to find all the places where conversion
> would need to take place.  For example, a user may enter those characters
> in a form (or paste them in from the clipboard).

If a user can enter them in a form, that porbably means he has 
the fonts for them. I think it's better the half width to full 
width conversion doesn't work in this case.

> > There may be other wrong effects with Lynx when a document includes
> > half width katakana. For example, which I found, Lynx fails to parse
> > TAGs in the below case.
> >
> >     X<p> (Assume X is half width katakana)
> >
> > # Precisely speaking, half width katakana is one byte in Shift_JIS and
> > # is two byte in EUC-JP. Lynx fails only when it's written in Shift_JIS.
> 
> But if there are half width characters in EUC-JP that are encoded as two
> bytes, the WHEREIS highlighting code should also fail.  I don't understand
> how it can work, functions in LYString.c at least seem to assume that
> if (HTCJK != NOCJK) then an eightbit character is always the first of a
> pair (for example in LYmbcsstrlen and LYno_attr_mbcs_strstr).

If CJK_EX is defined, Lynx will convert half width kana to full 
width kana also in EUC-JP (2byte -> 2byte) and the WHEREIS highlighting 
code will work properly.

> > +#if 0 /* This doesn't seemed to be valid code.
> > +       * ref: http://www.isi.edu/in-notes/iana/assignments/character-sets
> > +       */
> >  #define IS_EUC_LOS(lo)       ((0x21<=lo)&&(lo<=0x7E))        /* standard *
/
> > +#endif
> 
> Could it be necessary for some of the other EUC (not -JP) codes?
> Or could it be an attempt to support (from the IANA list)
> 
>    Name: Extended_UNIX_Code_Fixed_Width_for_Japanese
>    ...
>                 code set 3: JIS X0212-1990 (a double 7-bit byte set)
>                             restricted to A0-FF in
>                             the first byte
>                 and 21-7E in the second byte

I'm sorry the comment isn't correct.
I took care of only EUC-JP code set 0-2.

> > -#ifdef CJK_EX        /* 1998/11/24 (Tue) 17:02:31 */
> > +#if 0 /* This should be a business of GridText */
> 
> That's part of what I don't understand at all. :)
> (what should be whose business.)

I'm so sorry. This part converts half width kana to full width, 
but there is an almost same routine in the HTextAppendCharacter 
in GridText.c. I think this conversion should be in HTextAppendCharacter, 
because there are other conversion routines there and it's also 
called in the source mode.

> > +    if ((HTCJK==JAPANESE) && (context->state==S_in_kanji) &&
> > +     !IS_JAPANESE_2BYTE(kanji_buf,(unsigned char)c)) {
> > +#if CJK_EX
> > +     if (IS_SJIS_HWKANA(kanji_buf) && (last_kcode == SJIS)) {
> > +         JISx0201TO0208_SJIS(kanji_buf, &sjis_hi, &sjis_lo);
> > +         PUTC(sjis_hi);
> > +         PUTC(sjis_lo);
> > +     }
> > +     else
> > +         PUTC('=');
> > +#else
> > +     PUTC('=');
> > +#endif
> > +     context->state = S_text;
> > +    }
> 
> (This seems to be the place where the failure with
> >     X<p> (Assume X is half width katakana)
> comes in, right?

Yes.

> > @@ -1744,6 +1761,7 @@
> >       **  (see below). - FM
> 
> The comment (of which this is last line) should also be changed,
> it says
>         **  We could try to deal
>         **  with it by holding each first byte and then checking
>         **  byte pairs, but that doesn't seem worth the overhead
> 
> so it doesn't apply any more...

Yes. But I can't understand why he said 
"that doesn't seem worth the overhead" and what overhead is.

> >       */
> >       context->state = S_text;
> > +     PUTC(kanji_buf);
> >       PUTC(c);
> 
> You probably should also flush out the new kanji_buf in SGML_free
> (if it is a valid character).  It could be the last character of
> a file.  Of course that's rare, but it could even be valid HTML
> (</BODY></HTML> tags are not required).

I think flushing isn't needed. kanji_buf isn't used as a flag.
When new document comes, kanji_buf is always set in S_text block 
before going to S_in_kanji block.

> >       break;
> >
> > @@ -1772,7 +1790,7 @@
> >           **  to having raw mode off with CJK. - FM
> >           */
> >           context->state = S_in_kanji;
> > -         PUTC(c);
> > +         kanji_buf = c;
> >           break;
> 
> This S_in_kanji handling is only done from S_text state.
> Would it need to be repeated for S_value, S_quoted, and S_dquoted
> to get attibute values right (for example, ALT text)?  And maybe
> more of the states?  Would it have to be duplicated in HTPlain.c
> for plain text and source view?

The range of the character code is shown below.

    Shift_JIS
        2byte characters
            1st byte: 0x81-0x9F, 0xE0-0xFC
            2nd byte: 0x40-0x7E, 0x80-0xFC
        half width katakana (1byte)
            0xA1-0xDF
    EUC-JP
        full width characters
            1st byte: 0xA1-0xFE
            2nd byte: 0xA1-0xFE
        half width katakana
            1st byte: 0x8E
            2nd byte: 0xA1-0xDF

The problem is some codes of Shift_JIS characters are less than 0x80. 
But the range: 0x40-0x7E doesn't include '"', "#", "&", "-", and ' ', 
this means we don't have to care parsing attribute value.
This is doubtful, please tell me if I'm wrong.

I found the range: 0x40-0x7E doesn't include '<' and '>' also. 
Do I have to mention other code sets here?

> Hmm, so where does an explicit "charset" come in, if there is one?
> I.e., "text/html;charset=euc-jp" vs. "text/html;charset=shift_jis".
> It seems some things in SGML.c should depend on it, but I don't see
> it being considered at all.

It's better it's considered. But most pages in Japanese doesn't 
set charset at all. Lynx should take care of such pages first.
Lynx is a little bit worse guesser of Japanese code set than 
Netscape Navigator and Internet Explorer.
--
Takeshi Hataguchi
E-mail: address@hidden
[Prev in Thread]
Current Thread
[Next in Thread]
Re: lynx-dev changes for Japanese (was: dev.16 patch), Henry Nelson, 1999/12/13
- Re: lynx-dev changes for Japanese (was: dev.16 patch), Hataguchi Takeshi <=
Prev by Date: Re: lynx-dev proposed lynx.cfg change
Next by Date: lynx-dev Screen font with Win32 lynx (Was: blank subject line)
Previous by thread: Re: lynx-dev changes for Japanese (was: dev.16 patch)
Next by thread: [no subject]
Index(es):
- Date
- Thread