bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file


From: Agustin Martin
Subject: bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file
Date: Fri, 7 Jan 2011 14:14:03 +0100

2011/1/4 Reuben Thomas <rrt@sc3d.org>:
> With the following text, and using emacs -Q, I get the errors you can
> see in the messages log below when using hunspell to spell-check a UTF-8
> buffer with some extended characters in it.
>
> I did test this with emacs -Q, but the current session, in which I
> reproduced the problem and am now composing this bug report, was not
> started with -Q (this is so submitting the bug report works properly!).
>
> I am running a freshly bzr-pulled build of the emacs-23 branch.

Hi, Reuben,

I can also reproduce this with emacs23.2. I could locate problems in
two lines, after splititng original lines,

-- Cut here -- 8< ----- minimal.txt: utf-8
of out-of-copyright works. The Kindle may be a loss leader, but at £109
it’s still not cheap. Feedbooks, rather than integrating easily into
-- Cut here -- 8< ----- End of minimal.txt

In first line, currency seems to give some conversion errors when
iso-8859-1 is used, when that should have ignored by hunspell. I get
tons of

UTF-8 encoding error. Missing continuation byte in 0. character position:

for that line when using

$ cat minimal.txt | hunspell -d en_US -a -i iso-8859-1

In second line unusual apostrophe seems to cause some confusion to
hunspell when utf8 is used. Comparing what aspell and hunspell give in
similar text I get

$ cat minimal.txt | aspell --encoding=utf-8 -d en_US -a
& Feedbooks 6 22: Feed books, Feed-books, Feedback's, Feedbags, ...

$ cat minimal.txt | hunspell -d en_US -i utf-8 -a
& Feedbooks 8 24: Feed books, Feed-books, Feedback, Feedbags, ...

Do not worry about first number, is the number of suggestions. However
position in second number differ. Seems that hunspell is not
considering that apostrophe as a single (multibyte) char when
counting, but as three components

Looks to me an hunspell bug. I found no reference to this problem in
hunspell sf site, but noticed that Hunspell 1.2.14 was released
yesterday. Need to check if that has some related new.

-- 
Agustin





reply via email to

[Prev in Thread] Current Thread [Next in Thread]