[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Aspell-user] Configuring spell check in mult language documents
From: |
Mahesh T. Pai |
Subject: |
Re: [Aspell-user] Configuring spell check in mult language documents |
Date: |
Sat, 9 Jul 2011 00:19:20 +0530 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
Carlo Traverso said on Fri, Jul 08, 2011 at 08:24:54PM +0200,:
> aspell list -l lang1 | aspell list -l lang2
That would take the words out of their context, no?
> I did not check the hindi dictionaries, but probably hindi accepts
> both latin and hindi characters as word components (this is how
> ancient greek, grc, does). The solution of your problem could be to
> define a variant of hindi that only accepts hindi characters.
AFAICT, no. Especially if you are putting that in the linguistic
sense.
Hindi (and most Indic languages) use the 16 bit mapping in UTF-8
encoding schema.
I suspect that the difficulties mentioned by Kevin have more to do
with aspell being "internally 8 bit", as Kevin put it some months back.
Probably, the difficulty is in distinguishing between few bytes of 8
bit characters, followed by few bytes of 16 bit characters. Of course,
I am no expert or even a programmer and I may be way off mark.
If you want a look at the kind of documents we have in mind, have a
look at
http://finance.kerala.gov.in/
index.php?option=com_docman&task=doc_download&gid=3047&Itemid=34
(watchout for a broken line - to avoid problems in mailers)
That is a pdf file, with both English and Malayalam script. We use
plenty of documents like that. The pdf itself is unlikely to use
UTF-8, so do not use it as an example for anything except visual
representation of the text.
--
Mahesh T. Pai ||
DICTIONARY, n. A malevolent literary device for cramping the
growth of a language and making it hard and inelastic.