|
From: | Ciarán Ó Duibhín |
Subject: | [aspell-devel] Checking of word-marginal specials |
Date: | Thu, 20 Jun 2013 18:08:40 +0100 |
This is the second part (change #2) of my
consideration of apostrophes and hyphens in aspell. The first part (change
#1) was "Tokenization of word-initial specials" dated June 14
2013.
Currently, when *.dat marks apostrophe as valid
initially, the dictionary form well validates the token 'well
(in addition to the token well). And, when *.dat marks apostrophe
as valid finally, the dictionary form well also validates the token
well' . However, neither of the tokens 'well or
well' should ever be validated by the form well, but approved
only if those exact forms are present in the dictionary.
There are two cases: when the apostrophe is
encountered in a token in a position, initial or final, where it IS NOT valid in
*.dat (and note that this applies to en.dat), it is immediately dropped from the
token, and only the token without the apostrophe is checked against the
dictionary. (Before change #1, even a valid initial apostrophe was dropped
from the token, but not a valid final apostrophe.) So if "trying the token
without the special" is done with the intention of accepting a token of English
which has contrived to include a neighbouring quotation mark, this is
a non-existent situation whose removal will have no effect.
When the apostrophe is encountered in a token in a
position, initial or final, where it IS valid in *.dat, the token should be
accepted only if the dictionary contains the word including the apostrophe the
current practice of accepting the token, merely because the corresponding form
without the apostrophe is in the dictionary, is to accept an invalid word,
possibly resulting from a mistaken use of the apostrophe (ASCII hex 27) as a
quotation mark. (Remember that languages which accept valid word-marginal
apostrophes in *.dat do not use ASCII hex 27 as a quotation mark.)
The code for "trying the token with and without any
initial or final special" is found in procedure SensitiveCompare in
modules/speller/default/language.cpp at around line 428. The suggested
change #2 is to remove the code which, when the token begins or ends with a
valid special, and has failed to match the dictionary, compares the token minus
the special to the dictionary. (Note again that a token will never be
found to begin or end with an INVALID special, as that special will have been
dropped during tokenization.) Specifically, I suggest removal of the four
separate lines which use the special() function. Having no previous
experience of C++ programming I cannot say that everything has been done which
ought to be done, but the concept has been tried and shown to work. I do
not at present see any reason to make it conditional, ie. I cannot see any
situation where the present behaviour is preferable.
This suggestion will enable a language like
Italian, for example, to have a new it.dat in which word-final apostrophe is
allowed, and non-words like anch may be replaced in the dictionary by anch'
. Even for English, a new en.dat allowing marginal apostrophes and a new
dictionary (with, for example, 'twas and 'twill in place of twas and twill, and
adding 'tis and 'twould) could produce an improvement, but only with English
texts in which an encoding distinction has been made between apostrophe and
quotation mark. The main beneficiaries of the suggestion will be among
languages other than English.
As before, my experiments have been conducted using
the Hatier port of aspell for Windows at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2
.
Third and final part to follow.
Ciarán Ó Duibhín
|
[Prev in Thread] | Current Thread | [Next in Thread] |