aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aspell-user] aspell_speller_add_to_personal() doesn't accept hyphen


From: Gary Setter
Subject: Re: [Aspell-user] aspell_speller_add_to_personal() doesn't accept hyphens
Date: Mon, 15 Aug 2005 08:11:59 -0500

----- Original Message ----- 
From: "Michael Bunk" <address@hidden>
To: "Gary Setter" <address@hidden>
Cc: <address@hidden>
Sent: Sunday, August 14, 2005 10:06 PM
Subject: Re: [Aspell-user] aspell_speller_add_to_personal()
doesn't accept hyphens


> Hi Gary,
>
> thanks for your reply, though you didn't make it too easy for
me :)
>
> > Why are you using such an early version? I had issues with
that
> > aspect of aspell some time ago. I thought they were resolved.
>
> I have tried 0.60.3 now, but the behaviour is the same.
>
> > You can see what I did in this project:
> > http://sourceforge.net/projects/descdatadiary/
> > It is a windows project. It does not try to do a Unix
install,
> > but you may find what I did in speller_impl.cpp interesting.
>
> I have seen that you modified
> aspell-0.60.1-win32/modules/speller/default/speller_impl.cpp
> by implementing 3 new functions:
>
> int aspell_speller_word_seperator_length(speller, char *)
> It returns the number of bytes till the next word character,
using the aspell
> internal function !lang_->is_alpha().
>
> int aspell_speller_word_length(speller, char *)
> It returns the number of bytes till the next non-word
character, using
> lang_->is_alpha() as well.
>
> aspell_speller_add_lower_to_personal()
> This adds a lowercased version of the given string to the
personal word list.
> I guess you implemented this for capitalized words at sentence
starts?
>
> The problem I see with this approach is that you modified
aspell internal
> functions. But since I want to use aspell as a library, such
modifications
> are ruled out.
>
> While looking through the code I found that aspell implements a
Tokenizer
> class, which seems to be designed to do the same. It is not
exported, but
> it is used by the DocumentChecker class. Maybe I should try to
use that?
>
> But its documentation in aspell.h is confusing (besides being
misspelled :):
>
> /* process a string
>  * The string passed in should only be split on white space
>  * characters.  Furthermore, between calles to reset, each
string
>  * should be passed in exactly once and in the order they
appeared
>  * in the document.  Passing in stings out of order, skipping
>  * strings or passing them in more than once may lead to
undefined
>  * results. */
> void aspell_document_checker_process(struct
AspellDocumentChecker * ths, const
> char * str, int size)
>
> Does it mean I have to split my string to be checked at white
space before
> passing in the pieces to this function? Or does it mean that
this function
> usually only splits at white space?
>
> Kindest regards,
>  Michael

Hello Michael,

I'm sorry that you had difficulty with my code. Maybe we can both
learn from this. The new functions that you mentioned
(aspell_speller_word_seperator_length et. el.) are for different
problems. Also, I'm not sure if I will submit them to the aspell
project.

I haven't been using the DocumentChecker class, maybe I should
be. What I do want is a word tokenizer that is aware of the
character set and uses the Language classes 'classification'
functions (e.g. is_alpha)

What I suspect is that there are two stages to tokenizing. First,
something needs to break up words so that abbreviations are
checked as one word, as Kevin pointed out. Within the check
functions in speller_impl.cpp there is some breakup going on so
that hyphenated words are accepted (web-site). It was that second
level where I was having difficulty some time ago and which I
thought was fixed.

Kevin pointed you toward some examples. Are your problems now
solved? Do you agree that there are two levels of tokenization
going on, and if so at which level are you having difficulty?

Best regards,
Gary





reply via email to

[Prev in Thread] Current Thread [Next in Thread]