[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev changes for Japanese (was: dev.16 patch)
From: |
Hataguchi Takeshi |
Subject: |
Re: lynx-dev changes for Japanese (was: dev.16 patch) |
Date: |
Tue, 14 Dec 1999 00:04:38 +0900 (JST) |
Thank you very much for your comment.
On Sun, 12 Dec 1999, Klaus Weide wrote:
> What I was specifically thinking of was the discovery of
> SUPPORT_MULTIBYTE_EDIT. (I ws recently looking at LYStrings.c to
> get an idea how hard it would be tu add better support for UTF-8 to
> the line-editor. Hadn't paid much attention to those #ifdef'd sections
> there before - there's so much of it, I tend to blend it out.)
> SUPPORT_MULTIBYTE_EDIT seems to be defined only in two of the
> Windows-specifc makefiles, makefile.msc and makefile.bcb, and
> nowhere else. It is not mentioned in INSTALLATION, README.defines,
> not explained in the makefiles, not ifdef'd with any *_EX, and I
> couldn't find it mentioned in CHANGES, so I wonder where it's coming
> from.
It was in the Hiroyuki's patch, so I think it was merged in 2-8-3dev.4.
Originally it seems to be written by Takeshi Sasaki
http://katayama-www.cs.titech.ac.jp/~sasaki/lynx_jp/patches.html
(In Japanese)
> Anyway, it uses a hardwired IS_KANA macro that seems to be completely
> Shift_JIS specific. I think it should test the current display
> character set instead. Something like
>
> #define IS_KANA(c) (HTCJK==JAPANESE && current_char_set == SHIFT_JIS &&\
> 0xa0 <= c && c <= 0xdf)
I think you are right. And half width kana in EUC-JP should be
also cared.
# I defined IS_SJIS_HWKANA and IS_EUC_HWKANA macro in HTCJK.h.
> Is it true that, once Japanese text is in the HText structure, it is
> always converted to the right d.c.s., i.e., either EUC-JP or Shift_JIS?
Yes, for the almost all text except the part Lynx failes to detect
its character set (I belive this is rare).
> But I don't understand who needs the SUPPORT_MULTIBYTE_EDIT code. It
> seems to me every CJK charset user should need it, is that not the case?
> If it's true, then there shouldn't even be a special macro
> SUPPORT_MULTIBYTE_EDIT. And it should of course not be Windows-specific.
I think so too.
Should it be ifdef'd CJK_EX instead of SUPPORT_MULTIBYTE_EDIT?
> This leaves me wondering about the most basic functioning of line-editing
> with CJK display character sets. What happens when you delete, for
> example with the backspace key, one half of a multibyte character, without
> that special code? Does it work at all?
It doesn't work well.
We have to press backspace key twice to delete a multibyte character.
When pressing once, another 1byte character is shown.
> > There is a code for half width katakana, but we don't always have
> > fonts for it. So I think it's better Lynx converts half width katakana
> > into full width to display. If CJK_EX is defined,
> > Lynx actually does for almost all cases.
> > # Lynx didn't convert in the source mode, but my patch will improve it.
>
> Does that also apply to text/plain files?
Yes.
> You may also want to check the source mode variants, -preparsed and
> -prettysrc.
I tried. These option are very fun! I found it works as I expected. :-)
> > It seems (WHEREIS search) highlighting works well if CJK_EX is defined,
> > but it doesn't if not defined because half width katakana can be
> > in the screen. I think Lynx should always convert half width katakana
> > into full width. Are there any side effects?
>
> Maybe with alignment of preformatted text, including text/plain?
Including text/plain.
> Should this be a separate (configuration?) option, rather than everything
> being covered by CJK_EX?
I'm not sure. I don't know whether defining CJK_EX causes bad effects
for not CJK environment or not. If not, can CJK_EX be default?
> Is it too hard to deal with half width katakana in all the necessary
> places, rather than "forbidding" it? I assume it would be a lot of
> work.
I don't know exactly. It may be lot.
> But it may also ba a lot of work to find all the places where conversion
> would need to take place. For example, a user may enter those characters
> in a form (or paste them in from the clipboard).
If a user can enter them in a form, that porbably means he has
the fonts for them. I think it's better the half width to full
width conversion doesn't work in this case.
> > There may be other wrong effects with Lynx when a document includes
> > half width katakana. For example, which I found, Lynx fails to parse
> > TAGs in the below case.
> >
> > X<p> (Assume X is half width katakana)
> >
> > # Precisely speaking, half width katakana is one byte in Shift_JIS and
> > # is two byte in EUC-JP. Lynx fails only when it's written in Shift_JIS.
>
> But if there are half width characters in EUC-JP that are encoded as two
> bytes, the WHEREIS highlighting code should also fail. I don't understand
> how it can work, functions in LYString.c at least seem to assume that
> if (HTCJK != NOCJK) then an eightbit character is always the first of a
> pair (for example in LYmbcsstrlen and LYno_attr_mbcs_strstr).
If CJK_EX is defined, Lynx will convert half width kana to full
width kana also in EUC-JP (2byte -> 2byte) and the WHEREIS highlighting
code will work properly.
> > +#if 0 /* This doesn't seemed to be valid code.
> > + * ref: http://www.isi.edu/in-notes/iana/assignments/character-sets
> > + */
> > #define IS_EUC_LOS(lo) ((0x21<=lo)&&(lo<=0x7E)) /* standard *
/
> > +#endif
>
> Could it be necessary for some of the other EUC (not -JP) codes?
> Or could it be an attempt to support (from the IANA list)
>
> Name: Extended_UNIX_Code_Fixed_Width_for_Japanese
> ...
> code set 3: JIS X0212-1990 (a double 7-bit byte set)
> restricted to A0-FF in
> the first byte
> and 21-7E in the second byte
I'm sorry the comment isn't correct.
I took care of only EUC-JP code set 0-2.
> > -#ifdef CJK_EX /* 1998/11/24 (Tue) 17:02:31 */
> > +#if 0 /* This should be a business of GridText */
>
> That's part of what I don't understand at all. :)
> (what should be whose business.)
I'm so sorry. This part converts half width kana to full width,
but there is an almost same routine in the HTextAppendCharacter
in GridText.c. I think this conversion should be in HTextAppendCharacter,
because there are other conversion routines there and it's also
called in the source mode.
> > + if ((HTCJK==JAPANESE) && (context->state==S_in_kanji) &&
> > + !IS_JAPANESE_2BYTE(kanji_buf,(unsigned char)c)) {
> > +#if CJK_EX
> > + if (IS_SJIS_HWKANA(kanji_buf) && (last_kcode == SJIS)) {
> > + JISx0201TO0208_SJIS(kanji_buf, &sjis_hi, &sjis_lo);
> > + PUTC(sjis_hi);
> > + PUTC(sjis_lo);
> > + }
> > + else
> > + PUTC('=');
> > +#else
> > + PUTC('=');
> > +#endif
> > + context->state = S_text;
> > + }
>
> (This seems to be the place where the failure with
> > X<p> (Assume X is half width katakana)
> comes in, right?
Yes.
> > @@ -1744,6 +1761,7 @@
> > ** (see below). - FM
>
> The comment (of which this is last line) should also be changed,
> it says
> ** We could try to deal
> ** with it by holding each first byte and then checking
> ** byte pairs, but that doesn't seem worth the overhead
>
> so it doesn't apply any more...
Yes. But I can't understand why he said
"that doesn't seem worth the overhead" and what overhead is.
> > */
> > context->state = S_text;
> > + PUTC(kanji_buf);
> > PUTC(c);
>
> You probably should also flush out the new kanji_buf in SGML_free
> (if it is a valid character). It could be the last character of
> a file. Of course that's rare, but it could even be valid HTML
> (</BODY></HTML> tags are not required).
I think flushing isn't needed. kanji_buf isn't used as a flag.
When new document comes, kanji_buf is always set in S_text block
before going to S_in_kanji block.
> > break;
> >
> > @@ -1772,7 +1790,7 @@
> > ** to having raw mode off with CJK. - FM
> > */
> > context->state = S_in_kanji;
> > - PUTC(c);
> > + kanji_buf = c;
> > break;
>
> This S_in_kanji handling is only done from S_text state.
> Would it need to be repeated for S_value, S_quoted, and S_dquoted
> to get attibute values right (for example, ALT text)? And maybe
> more of the states? Would it have to be duplicated in HTPlain.c
> for plain text and source view?
The range of the character code is shown below.
Shift_JIS
2byte characters
1st byte: 0x81-0x9F, 0xE0-0xFC
2nd byte: 0x40-0x7E, 0x80-0xFC
half width katakana (1byte)
0xA1-0xDF
EUC-JP
full width characters
1st byte: 0xA1-0xFE
2nd byte: 0xA1-0xFE
half width katakana
1st byte: 0x8E
2nd byte: 0xA1-0xDF
The problem is some codes of Shift_JIS characters are less than 0x80.
But the range: 0x40-0x7E doesn't include '"', "#", "&", "-", and ' ',
this means we don't have to care parsing attribute value.
This is doubtful, please tell me if I'm wrong.
I found the range: 0x40-0x7E doesn't include '<' and '>' also.
Do I have to mention other code sets here?
> Hmm, so where does an explicit "charset" come in, if there is one?
> I.e., "text/html;charset=euc-jp" vs. "text/html;charset=shift_jis".
> It seems some things in SGML.c should depend on it, but I don't see
> it being considered at all.
It's better it's considered. But most pages in Japanese doesn't
set charset at all. Lynx should take care of such pages first.
Lynx is a little bit worse guesser of Japanese code set than
Netscape Navigator and Internet Explorer.
--
Takeshi Hataguchi
E-mail: address@hidden