aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aspell-user] What letters belongs to a word? (issu e with Turkish "


From: Kevin Atkinson
Subject: Re: [Aspell-user] What letters belongs to a word? (issu e with Turkish "ı")
Date: Fri, 9 Sep 2011 19:46:58 -0600 (MDT)
User-agent: Alpine 2.00 (BSF 1167 2008-08-23)

On Fri, 9 Sep 2011, Daniel wrote:

Agustin Martin <agustin.martin <at> hispalinux.es> writes:

On Wed, Sep 07, 2011 at 09:47:01AM +0000, Daniel wrote:
When I run my aspell (through emacs), the Turkish letter "ı" is not
considered to be part of words; I'm given suggestions "Bostanc",
when I actually wrote "Bastancı". I am given no suggestion for
"İstanbulı", since "ı" actually _is_ recognized as part of the
word... How can I tell aspell to include "ı" in words?

Which aspell and emacs version are you using?

This is emacs 23.3.1 and aspell 0.60.6.1.

To clarify, it is the Turkish "lowercase dotless i" that Aspell doesn't
recognize as part of a word. This is when I run aspell on my text using
an English dictionary (from aspell-en package). From within Emacs, or
with Aspell alone. I do pass --encoding=UTF-8, but it doesn't seem
necessary (it detects my locale, right).

But when I try with a Turkish dictionary, it does work. Then the
dotless-i is indeed part of the word. Probably because the tr.dat has
"charset iso8859-9", while en.dat has "charset iso8859-1". I didn't look
in the dat-files before. But this is a bit silly; I am using UTF-8!
Is this artifacts of old non-utf8 Aspell; it still needs to tie a
(narrow) character set to the dictionary it is spelling with?

More or less. Please see http://aspell.net/man-html/Notes-on-8_002dbit-Characters.html. That being said the problem you are facing is not just because the dictionary is 8-bit, but also because I convert the document to the 8-bit encoding before I tokenize it. The latter is something I plan to eventually fix. If you really want to be able to recognize Turkish words when using the English dictionary than you can try the attached special character set. Unzip the contents in `aspell config data-dir` then change "charset iso8859-1" to "charset iso8859-1-u" in en.dat.

However, even if Aspell did recognize the word correctly it would be unlikely to do what you want when using the English dictionary because special rules are needed to handle the Turkish ı when changing case.

Attachment: iso-8859-1-u.zip
Description: Zip archive


reply via email to

[Prev in Thread] Current Thread [Next in Thread]