[aspell-devel] Tokenization of word-initial specials

With reference to aspell source file modules/tokenizer/basic.cpp, procedure TokenizerBasic::advance(), the following block of code occurs at around the 18th line of the procedure:

    if (is_begin(*cur) && is_word(cur[1]))
    {
      cur_pos += cur->width;
      ++cur;
    }

This code applies to the case where the relevant *.dat file declares a non-letter to be a valid word-initial symbol, and a token beginning with this symbol is found in text being checked.

As the code stands, the valid word-initial symbol is not stored with the extracted token (unlike a valid non-letter in the middle or at the end of a token). I would suggest the inclusion of a statement

word.append(*cur);

as an additional first statement within the braces.

The non-retention of the initial symbol in the token produces the situation where, given a dictionary which contains the form 'twas but not the form twas , the token 'twas in text is refused, and the suggested replacement is 'twas .

Inclusion of the suggested statement repairs this behaviour, and in doing so it makes aspell conformant to its stated behaviour in http://aspell.net/man-html/Words-With-Symbols-in-Them.html:
"The case where the symbol can appear at the beginning or end of the word is more difficult to deal with...
Aspell currently handles this case by first trying to spell check the word with the symbol and if that fails, try it without."

The effect of the proposed change on English should be an improvement, though not a significant one: English examples are few and unimportant ('tis, 'twas, 'twill, 'twould). However in many languages word-initial (and word-final) apostrophes are common, and moreover ASCII hex 27 is not used as a quotation mark. An apostrophe is not a discardable punctuation mark, but part of the spelling of the word; removing the apostrophe produces a different word (or more usually, non-word). Nevertheless this is what aspell normally does; it includes in the dictionary the residue of the word without any marginal apostrophe, and (per the above quotation) checks the token less marginal apostrophe against the dictionary. This strategy may be serviceable, if ugly, for some languages, but the texts I wish to check contain so many words of this type that it would be necessary to admit legions of non-words into the dictionary and the whole operation breaks down.

The better way to proceed is to add the words to the dictionary complete with their marginal apostrophes, and to check the tokens complete with their marginal apostrophes against the dictionary. For this checking to work in the case of word-initial apostrophes, the suggested change to aspell is a necessary first step. At least two further steps will be needed also, before aspell will be able to reproduce the ability of the MS Word spell-checker to handle these situations satisfactorily.

I considered making a bug report on this, but I thought it needed more explanation than a bug report would normally contain. Also it would be necessary to be sure that there are no circumstances in which the present behaviour is preferable if there are, any change should be made to depend on a configuration option.

This proposal concerns valid word-initial symbols in general, including in particular ASCII hex 27, and is independent of any consideration of the Unicode apostrophe.

My experiments have been conducted using the Hatier port of aspell for Windows at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 .

Ciarán Ó Duibhín

From:	Ciarán Ó Duibhín
Subject:	[aspell-devel] Tokenization of word-initial specials
Date:	Fri, 14 Jun 2013 12:28:52 +0100