[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Aspell-user] Re: Feedback on our approach to Arabic
From: |
Mohammed Sameer |
Subject: |
[Aspell-user] Re: Feedback on our approach to Arabic |
Date: |
Sun, 12 Mar 2006 13:11:15 +0200 |
User-agent: |
Mutt/1.5.11+cvs20060126 |
On Sat, Mar 11, 2006 at 07:32:17PM -0800, Ethan Bradford wrote:
>
> I don't see having archaic words as a particular problem. It only reduces
> quality when a user misspells into one. Besides, some people might use
> them, and even if they know how to spell them, we don't want to bother them
> with spelling suggestions!
Yes, Some people might use them, But if you do use them, I guess you don't
really
need a spell checker,
I won't trade 10% of the people with 90% of them!
It doesn't make sense to me,
> I don't see that we have a lot of options besides the Buckwalter data at
> this moment. I think Arabic is too inflected to build a spell-checker from
> a straight word list.
True, I won't consider a word list fine unless it contains at least 1,000,000
words
That's _my_ personal estimation, I don't like the Buckwalter approach as I think
that it contains incorrect words, That's bad for a spell checker and since I
know
how the Arabic community works, I'll tell you that no one will ever review it
and I guess
"not sure" that no one will ever report an incorrect word.
If I'll be the one doing it, I won't use the Buckwalter when I know that it
contains
errors, It's not the kind of thing I'd like to maintain. But as I said before,
I'll
STFU since I have nothing to offer.
One of my friends is trying to get me modern Arabic files and spell check them
using
err M$ word "No flames please", It's just taking him a lot of time.
Best regards,
> Speaking of testing, does anybody on this list have good advice on testing
> a
> new dictionary? Just the obvious?
Or perhaps how to start creating a dictionary ?
>
> On 3/10/06, Mohammed Sameer <address@hidden> wrote:
>
> Sorry for the delay,
> I've tried generating a sample wordlist and it was fine, I don't really
> know why we
> assumed that aspell won't work with Arabic and M. Elzubeir started the
> Duali project
> and I forked Duali and coded Baghdad.
> Well, I have a working spell checker implementation that is using the
> Duali data set
> which is originally the Buckwalter data set.
> I can say that the set is not really accurate, It was identifying some
> misspelled words
> as correct and it was failing to identify correct words, While I can
> accept it not to
> identify all the correct words. I can't accept it saying that some
> misspelled words
> are correct.
> We have a lot of old words in Arabic that are not really used and being a
> native
> Arabic speaker, I don't think that it's a good idea to list them in our
> wordlist.
> If you use such words then you don't need a spell checker because
> definitely your
> language background is solid enough :-)
> I don't like the Buckwalter data set because it contains some incorrect
> words "of course
> it might be a problem in my implementation but it might be a problem with
> the data set
> itself" and because no one really had a look at it and removed old words.
> My idea was to generate a somehow authentic data set but I don't have
> enough *modern*
> Arabic text and even if I do, Who is going to check it for errors ? I'm a
> coder, Not a
> linguist and Although I'm a native Arabic speaker, My language is not
> really that good
> and I don't really have much time. All the people out there complaining
> about the
> an Arabic spell checker didn't help in that part and I can say that I'm
> stuck.
> I'm welling to maintain the list of course, But I'm really unable to
> generate
> the initial one.
> I can't tell you not to use the Buckwalter data set as I don't have a
> replacement for
> you even if I don't like it and I know that I should either do something
> or STFU.
> Best regards,
> On Tue, Mar 07, 2006 at 10:46:26PM -0800, Ethan Bradford wrote:
> >
> > Hi, Mohammed et al. Gokalp Yapici and I are also working on getting
> Arabic
> > for Aspell. I thought we could share our plans to see if anybody
> wants to
> > offer us helpful feedback.
> > For character-set data, we started with the Farsi implementation in
> Aspell,
> > which uses utf-8 as the word-list encoding and Windows Arabic as the
> > internal encoding.
> > For a word list, our plan is to use the data from Buckwalter's
> Arabic
> > morphological analyzer -- the same data used in the Duali attempt at
> Arabic
> > spell checking. This data has a complex specification of the
> structure of
> > an Arabic word, which we'll need to translate into the simpler
> format
> > required by Aspell.
> > In Buckwalter's format, each stem, prefix, or suffix is a member of
> a
> stem,
> > prefix, or suffix class. Three auxilliary files specify which
> prefix
> > classes can connect to which stem classes; which stem classes can
> connect to
> > which suffix classes; and which prefix classes are compatible with
> which
> > suffix classes.
> > If it weren't for that last file, this would be an easy problem: it
> would
> > just be a matter of translating code names. Instead, we'll write
> perl
> > scripts to recognize the easy translations (when no prefix/suffix
> > combination is allowd, or all combinations are allowed), and do the
> easy
> > thing. For the harder combinations (where some of the prefixes go
> to
> some
> > of the suffixes) we'll expand out the prefixes or the suffixes
> (whichever
> > there are fewer of), combining them with the stems as new "stem"
> entries.
> > There are a total of 170 affix (suffix and prefix) classes to start
> with.
> > We'll probably more than run out of Aspell class codes (they're
> limited to
> > 255) with the new classes we're creating. If that's very severe,
> I'll see
> > if we can't get Aspell updated to allow more suffix
> classes. Otherwise,
> > we'll just explicitly expand out the combinations which lead to the
> fewest
> > new entries in the stem list.
> > What are some of the issues we haven't thought of? Any feedback is
> welcome!
> --
> GNU/Linux registered user #224950
> Proud Egyptian GNU/Linux User Group <[2]www.eglug.org> Admin.
> Life powered by Debian, Homepage: [3]www.foolab.org
> --
> Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
> Read [4]http://www.gnu.org/philosophy/no-word-attachments.html
> Preferable attachments: .PDF, .HTML, .TXT
> Thanx for adding this text to Your signature
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (GNU/Linux)
> iD8DBQFEET9Jy2aOKaP9DfcRAmvDAKCOu1s8qbhxAeADTuekIHedgb+gygCfZg/j
> 86BFFCgyCwWVV+VRKc5pQps=
> =sILT
> -----END PGP SIGNATURE-----
>
> References
>
> 1. mailto:address@hidden
> 2. http://www.eglug.org/
> 3. http://www.foolab.org/
> 4. http://www.gnu.org/philosophy/no-word-attachments.html
--
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Admin.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature