"Ispell misalignment" encountering stray non-ASCII word in UTF-8 text do

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

"Ispell misalignment" encountering stray non-ASCII word in UTF-8 text do

From:	Jesse Glick
Subject:	"Ispell misalignment" encountering stray non-ASCII word in UTF-8 text doc
Date:	Tue, 18 Nov 2003 13:44:31 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031016

This bug report will be sent to the Free Software Foundation,
not to your local site managers!
Please write in English, because the Emacs maintainers do not have
translators to read other languages for them.

Your bug report will be posted to the bug-gnu-emacs@gnu.org mailing list,
and to the gnu.emacs.bug news group.

In GNU Emacs 21.3.1 (i386-redhat-linux-gnu, X toolkit, Xaw3d scroll bars)
 of 2003-06-05 on daffy.perf.redhat.com

configured using `configure --host=i386-redhat-linux--build=i386-redhat-linux --target=i386-redhat-linux-gnu--program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin--sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share--includedir=/usr/include --libdir=/usr/lib --libexecdir=/usr/libexec--localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man--infodir=/usr/share/info --with-gcc --with-pop --with-sound'

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8
  default-enable-multibyte-characters: t

Please describe exactly what actions triggered the bug
and the precise symptoms of the bug:

0. System locale uses UTF-8:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

1. Create a text file somefile.txt with contents:

---%<--- (non-ASCII chars marked with <<...>>)
This is spelled correctly.
This is correct but has uses a word from <<\v{c}>>e<<\v{s}>>tina (Czech).
The word was encoded in UTF-8.
Back to English again.
---%<---

(Most easily: set input method to "czech", then press 4 and 3, resp.,for the two non-ASCII chars.)

Specifically, using Latin-B characters in UTF-8 encoding, hexl-find-fileshows:


---%<---
00000000: 5468 6973 2069 7320 7370 656c 6c65 6420  This is spelled
00000010: 636f 7272 6563 746c 792e 0a54 6869 7320  correctly..This
00000020: 6973 2063 6f72 7265 6374 2062 7574 2068  is correct but h
00000030: 6173 2075 7365 7320 6120 776f 7264 2066  as uses a word f
00000040: 726f 6d20 c48d 65c5 a174 696e 6120 2843  rom ..e..tina (C
00000050: 7a65 6368 292e 0a54 6865 2077 6f72 6420  zech)..The word
00000060: 7761 7320 656e 636f 6465 6420 696e 2055  was encoded in U
00000070: 5446 2d38 2e0a 4261 636b 2074 6f20 456e  TF-8..Back to En
00000080: 676c 6973 6820 6167 6169 6e2e 0a         glish again..
---%<---

2. aspell is installed:

$ rpm -qa | fgrep spell
gnome-spell-0.5-1.ximian.3
aspell-0.33.7-1
pspell-0.12.2-3

ispell-program-name's value is "aspell"

3. Run M-x ispell-buffer and select somefile.txt

4. The spell checking aborts on the foreign word:

---%<---
Spell checking somefile.txt using default dictionary...
Spell-checking done

ispell-process-line: Ispell misalignment: word `Bh' point 71; probablyincompatible versions

---%<---

Clearly I do not expect ASpell to have the foreign word in its
dicitionary. It would be fine if it did not even know that it was
UTF-8 encoded as such, and pretended it was in ISO-8859-1 (would be
garbled but would not matter anyway). However it is a problem that the
spell checking was aborted with a cryptic error message (which is
probably wrong) and no apparent way to skip over the problematic
section of the document and continue in the otherwise English ASCII
sections, which I actually want checked.

ispell.el has:

---%<---
(if (/= (+ word-len (point))
        (progn
          ;; NB: Search can fail with Mule character sets that don't
          ;;  display properly.  Ignore the error in this case?
          (search-forward (car poss) (+ word-len (point)) t)
          (point)))
    ;; This occurs due to filter pipe problems
    (error (concat "Ispell misalignment: word "
                   "`%s' point %d; probably incompatible versions")
           (car poss) (marker-position word-start)))
---%<---

Command-line output from aspell, by the way, is reasonable and handles
the error gracefully; it is only the Emacs integration that fails:

---%<--- (run from M-x shell, non-ASCII parts clipped)
$ aspell -c /tmp/somefile.txt
This is correct but has uses a word from <<*GARBLED*>>tina (Czech).

1) ea 6) e2) ear 7) EGA3) A 8) EPA4) E 9) ERA5) a 0) ETAi) Ignore I) Ignore allr) Replace R) Replace alla) Add x) Exit

? i

This is correct but has uses a word from <<GARBLED>>*tina* (Czech).

1) Tina 6) tins2) Tiena 7) Rina3) tuna 8) Ina4) tin 9) Dina5) Trina 0) Tanai) Ignore I) Ignore allr) Replace R) Replace alla) Add x) Exit

?
The word was encoded in *UTF*-8.

1) UT 6) TUG2) UHF 7) IT3) UTA 8) UFO4) UTE 9) UFA5) TUFT 0) IFi) Ignore I) Ignore allr) Replace R) Replace alla) Add x) Exit

? i

---%<---

Thanks for your time and a great editor,
-Jesse

--
Jesse Glick <mailto:jesse.glick@sun.com> x22801
NetBeans, Open APIs  <http://www.netbeans.org/>

[Prev in Thread]

Current Thread

[Next in Thread]

"Ispell misalignment" encountering stray non-ASCII word in UTF-8 text doc, Jesse Glick <=

Prev by Date: INVESTORS: Blue-Chip, Stock-Trading System---77% Return---Automated...giulia
Next by Date: RE:auto-dialer for prepaid cards
Previous by thread: INVESTORS: Blue-Chip, Stock-Trading System---77% Return---Automated...giulia
Next by thread: RE:auto-dialer for prepaid cards
Index(es):
- Date
- Thread