[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Aspell-user] aspell_speller_add_to_personal() doesn't accept hyphen
From: |
Michael Bunk |
Subject: |
Re: [Aspell-user] aspell_speller_add_to_personal() doesn't accept hyphens |
Date: |
Thu, 25 Aug 2005 10:40:46 +0530 |
User-agent: |
KMail/1.5.4 |
Hello Gary,
> I'm sorry that you had difficulty with my code. Maybe we can both
> learn from this. The new functions that you mentioned
> (aspell_speller_word_seperator_length et. el.) are for different
> problems. Also, I'm not sure if I will submit them to the aspell
> project.
It is not that I had problems with the code. But without `diff' I
would not have been able to spot what you changed. And possibly
despite diff I missed what you wanted to point me to. For what
different problems were they? I think you used them for
identifying word starts and endings as aspell expects it, honouring
the aspell config option "special".
> I haven't been using the DocumentChecker class, maybe I should
> be. What I do want is a word tokenizer that is aware of the
> character set and uses the Language classes 'classification'
> functions (e.g. is_alpha)
I think that is exactly what aspell's classes Tokenizer and
TokenizerBasic (derived from Tokenizer) are doing. They are used by
the DocumentChecker class. TokenizerBasic::advance() skips space and
then uses is_word(), is_begin() and is_end() to eat the next token.
> What I suspect is that there are two stages to tokenizing. First,
> something needs to break up words so that abbreviations are
> checked as one word, as Kevin pointed out. Within the check
> functions in speller_impl.cpp there is some breakup going on so
> that hyphenated words are accepted (web-site). It was that second
> level where I was having difficulty some time ago and which I
> thought was fixed.
In the English dictionary the only special character allowed is the
apostrophe in the middle of a word, see en.dat in
aspell6-en-6.0-0.tar.bz2:
special ' -*-
So words containing hyphens are tokenized into two pieces which are
looked up independently in the word list. You can evoke this
behaviour by creating a text file with a hyphenated word where one
part doe not appear in the word list, eg. "uimpossiblepart-word".
Aspell will complain about the first part, not the whole hyphenated
thing.
> Kevin pointed you toward some examples. Are your problems now
> solved? Do you agree that there are two levels of tokenization
> going on, and if so at which level are you having difficulty?
I have changed my code to use the DocumentChecker, like
pointed out in the example code. Since the parts of my hyphenated
words are found in the word list, I'm not trying anymore to add
hyphenated words to my personal word list. That makes the spell check
work. So even though I still cannot add hyphenated words to my
word list without modifying the "special" config option, I guess I
wanted something I did not need.
I don't think there are two levels of tokenization going on. Maybe
this term is confusing. I remember having read in the docs that
abbreviations are looked up first without a dot and then with it. But
I didn't look in that code (since I don't want to spend time looking
for it), so I don't really know. So I also can't tell you at which
level I had problems... :)
Sending kindest regards and thanks for your help,
Michael