aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Aspell-user] Re: special chars getting counted as a word


From: Aaron Miller
Subject: [Aspell-user] Re: special chars getting counted as a word
Date: Wed, 15 Aug 2007 23:08:46 -0700

I think my last email went to the wrong place. If not, I apologize.
Here it is...



On 8/15/07, Aaron Miller <address@hidden> wrote:
> I agree, that would be the ideal solution. The problem I'm facing with
> that however, is determining the correct parsing algorithm used by
> aspell. So far I've just been trying to "reverse engineer" it by trial
> and error. I thought I had it pretty much figured out until this
> popped up.
>
> Is there any information out there that gives the exact algorithm (the
> name of the source file would do), or even better, a regular
> expression to use? I tried looking through the docs but didn't find
> anything except for some info on 8 bit chars. It said that it will
> convert UTF-8 chars to 8 bit chars. Maybe this is why it was counted
> as a word?
>
> Well I will keep plugging away at it. Any suggestions will be greatly
> appreciated.
>
> Thanks!
> ...aaron
>
> On 8/14/07, address@hidden <address@hidden> wrote:
> > Hello,
> >
> > I see in your sample url following:
> > scope â€" geniral usage
> >
> > Therefore for me aspell works (almost) perfectly
> > considering
> > scope
> >  â€"
> > usage
> >  as correct
> > and geniral as an error
> >
> > In my opinion your algorithm should consider
> >  â€" as a word, and that would fix the problem.
> >
> > -eleonora
> >
> >
> > > Hello,
> > >
> > > I am playing around with aspell as a server side spell checker for a
> > > flash application. It works beautifully (and fast as hell too!), but I
> > > did notice one little oddity that I haven't been able to find an
> > > explanation for in the docs.
> > >
> > > The problem happens when there is a special character in the text. I
> > > am not sure all of the special characters that cause my word counting
> > > algorithm to fail, but here is an example of the one that caused
> > > breakage (one of those long dashes that was in some text copied from a
> > > wiki).
> > >
> > > http://labs.splashlabs.com/spellcheck/1186978249
> > >
> > > When I pipe the above file through aspell (en_US), i get back the result:
> > >
> > > aspell -a < 1186978249
> > > @(#) International Ispell Version 3.1.20 (but really Aspell 0.50.5)
> > > *
> > > *
> > > & geniral 5 10: general, genital, genial, generally, generals
> > > *
> > >
> > > So it appears to count the lone character as a word. In my own
> > > program, I have to count words to find the start and end char indexes
> > > of the incorrect word. Since my algorithm does not count it as a word,
> > > my word count becomes off.
> > >
> > > Are there any options I can pass to prevent it from being counted? Or
> > > is there a way to figure out what all is counted as a word so I can
> > > match my own regex to it?
> > >
> > > Thanks for any advice!
> > > ...aaron
> >
> > --
> > GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
> > Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
> >
> >
> > _______________________________________________
> > Aspell-user mailing list
> > address@hidden
> > http://lists.gnu.org/mailman/listinfo/aspell-user
> >
>
>
> --
> Aaron Miller
> Chief Technology Officer
> Splash Labs, LLC.
> address@hidden  |  206-328-5485
> http://www.splashlabs.com
>


-- 
Aaron Miller
Chief Technology Officer
Splash Labs, LLC.
address@hidden  |  206-328-5485
http://www.splashlabs.com




reply via email to

[Prev in Thread] Current Thread [Next in Thread]