aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aspell-user] aspell_speller_add_to_personal() doesn't accept hyphen


From: Michael Bunk
Subject: Re: [Aspell-user] aspell_speller_add_to_personal() doesn't accept hyphens
Date: Thu, 25 Aug 2005 10:40:46 +0530
User-agent: KMail/1.5.4

Hello Gary,

> I'm sorry that you had difficulty with my code. Maybe we can both
> learn from this. The new functions that you mentioned
> (aspell_speller_word_seperator_length et. el.) are for different
> problems. Also, I'm not sure if I will submit them to the aspell
> project.

It is not that I had problems with the code. But without `diff' I 
would not have been able to spot what you changed. And possibly 
despite diff I missed what you wanted to point me to. For what 
different problems were they? I think you used them for 
identifying word starts and endings as aspell expects it, honouring 
the aspell config option "special".

> I haven't been using the DocumentChecker class, maybe I should
> be. What I do want is a word tokenizer that is aware of the
> character set and uses the Language classes 'classification'
> functions (e.g. is_alpha)

I think that is exactly what aspell's classes Tokenizer and 
TokenizerBasic (derived from Tokenizer) are doing. They are used by 
the DocumentChecker class. TokenizerBasic::advance()  skips space and 
then uses is_word(), is_begin() and is_end() to eat the next token.

> What I suspect is that there are two stages to tokenizing. First,
> something needs to break up words so that abbreviations are
> checked as one word, as Kevin pointed out. Within the check
> functions in speller_impl.cpp there is some breakup going on so
> that hyphenated words are accepted (web-site). It was that second
> level where I was having difficulty some time ago and which I
> thought was fixed.

In the English dictionary the only special character allowed is the 
apostrophe in the middle of a word, see en.dat in 
aspell6-en-6.0-0.tar.bz2:
special ' -*-
So words containing hyphens are tokenized into two pieces which are 
looked up independently in the word list. You can evoke this 
behaviour by creating a text file with a hyphenated word where one 
part doe not appear in the word list, eg. "uimpossiblepart-word". 
Aspell will complain about the first part, not the whole hyphenated 
thing.

> Kevin pointed you toward some examples. Are your problems now
> solved? Do you agree that there are two levels of tokenization
> going on, and if so at which level are you having difficulty?

I have changed my code to use the DocumentChecker, like 
pointed out in the example code. Since the parts of my hyphenated 
words are found in the word list, I'm not trying anymore to add 
hyphenated words to my personal word list. That makes the spell check 
work. So even though I still cannot add hyphenated words to my 
word list without modifying the "special" config option, I guess I 
wanted something I did not need.

I don't think there are two levels of tokenization going on. Maybe 
this term is confusing. I remember having read in the docs that 
abbreviations are looked up first without a dot and then with it. But 
I didn't look in that code (since I don't want to spend time looking 
for it), so I don't really know. So I also can't tell you at which 
level I had problems... :)

Sending kindest regards and thanks for your help,
 Michael






reply via email to

[Prev in Thread] Current Thread [Next in Thread]