[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Aspell-user] Re: Feedback on our approach to Arabic
From: |
Mohammed Sameer |
Subject: |
[Aspell-user] Re: Feedback on our approach to Arabic |
Date: |
Fri, 10 Mar 2006 10:56:41 +0200 |
User-agent: |
Mutt/1.5.11+cvs20060126 |
Sorry for the delay,
I've tried generating a sample wordlist and it was fine, I don't really know
why we
assumed that aspell won't work with Arabic and M. Elzubeir started the Duali
project
and I forked Duali and coded Baghdad.
Well, I have a working spell checker implementation that is using the Duali
data set
which is originally the Buckwalter data set.
I can say that the set is not really accurate, It was identifying some
misspelled words
as correct and it was failing to identify correct words, While I can accept it
not to
identify all the correct words. I can't accept it saying that some misspelled
words
are correct.
We have a lot of old words in Arabic that are not really used and being a native
Arabic speaker, I don't think that it's a good idea to list them in our
wordlist.
If you use such words then you don't need a spell checker because definitely
your
language background is solid enough :-)
I don't like the Buckwalter data set because it contains some incorrect words
"of course
it might be a problem in my implementation but it might be a problem with the
data set
itself" and because no one really had a look at it and removed old words.
My idea was to generate a somehow authentic data set but I don't have enough
*modern*
Arabic text and even if I do, Who is going to check it for errors ? I'm a
coder, Not a
linguist and Although I'm a native Arabic speaker, My language is not really
that good
and I don't really have much time. All the people out there complaining about
the
an Arabic spell checker didn't help in that part and I can say that I'm stuck.
I'm welling to maintain the list of course, But I'm really unable to generate
the initial one.
I can't tell you not to use the Buckwalter data set as I don't have a
replacement for
you even if I don't like it and I know that I should either do something or
STFU.
Best regards,
On Tue, Mar 07, 2006 at 10:46:26PM -0800, Ethan Bradford wrote:
>
> Hi, Mohammed et al. Gokalp Yapici and I are also working on getting Arabic
> for Aspell. I thought we could share our plans to see if anybody wants to
> offer us helpful feedback.
> For character-set data, we started with the Farsi implementation in Aspell,
> which uses utf-8 as the word-list encoding and Windows Arabic as the
> internal encoding.
> For a word list, our plan is to use the data from Buckwalter's Arabic
> morphological analyzer -- the same data used in the Duali attempt at Arabic
> spell checking. This data has a complex specification of the structure of
> an Arabic word, which we'll need to translate into the simpler format
> required by Aspell.
> In Buckwalter's format, each stem, prefix, or suffix is a member of a stem,
> prefix, or suffix class. Three auxilliary files specify which prefix
> classes can connect to which stem classes; which stem classes can connect
> to
> which suffix classes; and which prefix classes are compatible with which
> suffix classes.
> If it weren't for that last file, this would be an easy problem: it would
> just be a matter of translating code names. Instead, we'll write perl
> scripts to recognize the easy translations (when no prefix/suffix
> combination is allowd, or all combinations are allowed), and do the easy
> thing. For the harder combinations (where some of the prefixes go to some
> of the suffixes) we'll expand out the prefixes or the suffixes (whichever
> there are fewer of), combining them with the stems as new "stem" entries.
> There are a total of 170 affix (suffix and prefix) classes to start with.
> We'll probably more than run out of Aspell class codes (they're limited to
> 255) with the new classes we're creating. If that's very severe, I'll see
> if we can't get Aspell updated to allow more suffix classes. Otherwise,
> we'll just explicitly expand out the combinations which lead to the fewest
> new entries in the stem list.
> What are some of the issues we haven't thought of? Any feedback is
> welcome!
--
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Admin.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature
signature.asc
Description: Digital signature