|
From: | Ciarán Ó Duibhín |
Subject: | [aspell-devel] Tokenization of words containing hyphens |
Date: | Fri, 21 Jun 2013 12:31:48 +0100 |
This is the third and last part (change
#3) of my consideration of apostrophes and hyphens in aspell.
Languages may have words containing an internal
hyphen, but with the components not being themselves words of the language (a
possible English example is hotch-potch). In such languages it is
well to allow a word-internal hyphen in *.dat and put such "compounds" in the
dictionary. No new code is required for this. However, with the
change in status of the hyphen, all hyphenated compounds not explicitly included
in the dictionary will now be rejected, even if their components are all in the
dictionary. To avoid this, new code is needed, for languages supporting
internal hyphen, to examine a rejected word, and if it contains an internal
hyphen, to check the components separately. If all the components are
accepted, so is the compound. The hyphen itself will not be included in
the separate components on either side of it.
There is something else we can do, when a hyphen is
found in a token: we can check whether the component before AND INCLUDING the
hyphen might be a known prefix; or whether the component after AND INCLUDING the
hyphen might be a known suffix. Thus the dictionary could be allowed to
include prefixes (including a final hyphen) and suffixes (including an initial
hyphen), and we can modify *.dat to allow this. Code must be added to
support matching of prefixes and suffixes, to be activated if *.dat allows
initial/terminal hyphen, and when a rejected token contains an internal
hyphen.
The extra code for processing a token containing an
internal hyphen, after the token has been rejected as a whole, is positioned in
modules/speller/default/speller_impl.cpp, in procedure SpellerImpl::check at
around line 190. The new code is placed before the checking for two words
run together without a space, though this may not be the best place for
it. NOTE that I don't understand the purpose of parameters 36 to
procedure check, or the corresponding parameters to procedure check2, and
probably have not used them correctly. But the concept is shown to
work.
Here is the additional code:
unsigned
i=0;
while (*(word+i)!= 0) { if ((i > 0) && (i < word_end-word-1) && (*(word+i)=='-')) { if (lang_->special('-').end) { /* test up to hyphen as prefix, test remainder recursively as word */ char t = *(word+i+1); *(word+i+1) = (char) 0; if (check2(word, try_uppercase, *ci, gi)) { *(word+i+1) = t; if (check(word+i+1, word_end, try_uppercase, run_together_limit, ci, gi)) return true; } else *(word+i+1) = t; } if (lang_->special('-').middle) { /* test up to hyphen as word, test remainder recursively as word, then as suffix */ *(word+i) = (char) 0; if (check2(word, try_uppercase, *ci, gi)) { *(word+i) = '-'; if (check(word+i+1, word_end, try_uppercase, run_together_limit, ci, gi)) return true; else { if (lang_->special('-').begin) { if (check(word+i, word_end, try_uppercase, run_together_limit, ci, gi)) return true; } } } else *(word+i) = '-'; } } ++i; } For this code to work as intended, change #2 is
also necessary. Consider the token spell-check . We must
test to see if the dictionary contains a prefix spell- or a suffix
-check or plain words spell and check. We would
expect to find no such prefix or suffix, but to find the two plain words.
But unless change #2 is made, the token spell- will be accepted as
matching the dictionary form spell and the process will be ended
prematurely, albeit with the right result in this case.
As before, my experiments have been conducted using
the Hatier port of aspell for Windows at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2
. The changes suggested in these three messages have been made to this
source and compiled using VC++ 2005. On the evidence so far, the changes
appear to be working as intended, thereby solving the three problems I reported
to aspell-user on 19 May 2013, and allowing aspell to treat the tokenization of
apostrophes and hyphens in a similar way to the MS Word spell-checker. As
far as I can see, no existing functionality is adversely affected by these
changes.
Ciarán Ó Duibhín
|
[Prev in Thread] | Current Thread | [Next in Thread] |