[aspell] Updated Aspell's International Plans

aspell-user
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell] Updated Aspell's International Plans

From:	Kevin Atkinson
Subject:	[aspell] Updated Aspell's International Plans
Date:	Tue, 14 Nov 2000 22:28:07 -0500 (EST)
I have just updated my Aspell's International Plans page available at
http://aspell.sourceforge.net/international/

Here is the text of the page:

Aspell's International Plans

Last Updated November 14, 2000

--------------------------------------------------------------------------
  Current Plans | Notes | Current Language Data | Back to Aspell's Home   
--------------------------------------------------------------------------

Current Plans and Status

Here are my current plans for International support in Aspell and the
Current Status as of Aspell .32.6. Your feedback is more than appreciated.
Please send feedback to the Aspell Mailing list (
address@hidden), or directly to me (kevina at users
sourceforge net) ) as I am sure I am still overlooking things.
   
  * Internally store languages that can fit within an existing 8-bit
    character set as such. The character set should be chosen from one of
    the standard character sets in common use which include the ISO8859
    Series, VISCII, and the KOI8 Series character maps. If none of the
    existing character sets are adequate a new one can be created.
   
  * Support for languages which do not fit within 8-bit character set
    will come latter. The reason for this is that almost all languages
    which do not fit within an 8-bit character set can not be spell
    checked in the traditional fashion. When I expand Aspell to support
    spell checking these languages I will also expand Aspell to work with
    wide characters. However for right now this brings an extra level of
    complexity that I don't want to deal with.
   
  * Provide translation maps to convert the various charsets, HTML, Tex,
    Nroff, and ASCII representations to and from the Unicode character
    set. When simple translation maps are insufficient or inefficient,
    such as UTF-8 or SCSU, then use some simple C++ functions instead. In
    some cases a combined approach might be most appropriate such as with
    HTML. These translation maps and/or functions will allow conversion
    from any input format to any output format. Two good source for these
    map are The various key maps for Yudit, The Official mappings from the
    Unicode Consortium, and Roman Czyborra textual reference tables
    scattered about the ISO8859, VISCII, and Cyrillic pages.
   
    These translations are not supported in the current version of Aspell.
    However, They will be as soon as I am done with the rewriting of the
    filter interfaces. This should happen soon but not right away. I am
    currently aiming to get this done by the end of January 2001.
   
  * Information about the characters--such as which are consonants,
    which are vowels, which are symbols, and upper to lower case
    conversion--are currently stored as part of Unicode character map
    table. The Official data from the Unicode Consortium provides most of
    this information.
   
  * This leaves precious little information needed for the actual
    language support file. The language support file includes information
    such as the language name, the charset to use, special non-letter
    characters which can be part of the word such as ' or -, and
    run-together behavior.
   
    Future versions of Aspell will support have better support for
    specifying how words can be included in run-together words as well as
    affix knowledge which will ultimately lead to affix compress much like
    it is in Ispell. However, unless some one contributes code to support
    affix compression, support for this will not come until after
    everything else is done.
   
  * Additional information which can be included in the language support
    file is information to convert a word to its rough soundslike
    equivalent. If this information is not included than the default
    algorithm of simply stripping all the vowels will be used for the
    rough soundslike equivalent will be used. To see how this information
    is used please see the How Aspell Works chapter of the Aspell manual..
    Also see Section 5.3: Phonetic Code for more information on the format
    of the translation array Aspell uses to convert words to its
    soundslike equivalent.
   
--------------------------------------------------------------------------

Notes on 8-bit Characters

There is a very good reason I use 8-bit characters in Aspell. Speed and
simplicity. While many parts of my code can fairly be easily be converted
to some sort of wide character as my code is clean. Other parts can not
be.

One of the reasons because is many, many places I use a direct lookup to
find out various information about characters. With 8-bit characters this
is very feasible because there is only 256 of them. With 16-bit wide
characters this will waste a LOT of space. With 32-bit characters this is
just plain impossible. Converting the lookup tables to some other form,
while certainly possible, will degrade performance significantly.

Furthermore, some of my algorithms relay on words consisting only on a
small number of distinct characters (often around 30 when case and accents
are not considered). When the possible character can consist of any
Unicode character this number because several thousand, if that. In order
for these algorithms to still be used some sort of limit will need to be
placed on the possible characters the word can contain. If I impose that
limit, I might as well use some sort of 8-bit characters set which will
automatically place the limit on what the characters can be.

There is also the issue of how I should store the word lists in memory? As
a string of 32 bit wide characters. Now that is using up 4 times more
memory than charters would and for languages that can fit within an 8-bit
character that is, in my view, a gross waste of memory. So maybe I should
store them is some variable width format such as UTF-8. Unfortunately,
way, way to many of may algorithms will simply not work with variable
width characters without significant modification which will very likely
degrade performance. So the solution is to work with the characters as
32-bit wide characters and than convert it to a shorter representation
when storing them in the lookup tables. Now than can lead to an
inefficiency. I could also use 16 bit wide characters however that may not
be good enough to hold all of future versions of Unicode and it has the
same problems.

As a response to the space waste used by storing word lists in some sort
of wide format some one asked:
   
    Since hard drive are cheaper and cheaper, you could store dictionary
    in a usable (uncompressed) form and use it directly with memory
    mapping. Then the efficiency would directly depend on the disk caching
    method, and only the used part of the dictionaries would relay be
    loaded into memory. You would no more have to load plain dictionaries
    into main memory, you'll just want to compute some indexes (or
    something like that) after mapping.

However, the fact of the matter is that most of the dictionary will be
read into memory anyway if it is available. If it is not available than
there would be a good deal of disk swaps. Making characters 32-bit wide
will increase the change that there are more disk swap. So the bottom line
is that it will be cheaper to convert the characters from something like
UTF-8 into some sort of wide character. I could also use some sort of disk
space lookup table such as the Berkeley Database. However this will
DEFINITELY degrade performance.

The bottom line is that keeping Aspell 8-bit internally is a very well
though out decision that is not likely to change any time soon. Fell free
to challenge me on it, but, don't expect me to change my mind unless you
can bring up some point that I have not thought of before and quite
possible a patch to solve cleanly convert Aspell to Unicode internally
with out a serious performance lost OR serious memory usage increase.

Notes on Affix Compression

Due to large amount of affixation in many non-English languages. The most
natural way to store a word list in a compact fashion is with affix
compression. Affix compression basically stores the root word and then a
list of prefixes and suffixes that the word can take. Affix compression
will save space for almost any language and general affix knowledge will
also allow help to improve Aspell suggestion intelligence.

Affix compression will save space a lot of space for languages with a lot
of affixation however it is not vital. The reason is that without affix
compression all you have to do is list all all of the possible
combinations. I release that this wastes space however it is doable.
For example the word list that comes with Aspell has
 70,598 words                                       
After running it through the munchlist script it has
 30,953 words                                       
Which leads to a ratio of                           
 2.3                                                

Now a polish word lists has the numbers.
 1,041,430 
 146,626   
 7.1       

Which means that the polish language affix compression saves about 3.1
times more space than it would for the English dictionary. Not that big of
a deal.

Also, notice that this dictionary is mighty large. Especially considering
that the largest English word list I have has these numbers.
 120,361 
 73,358  
 1.6     

So my question is do you really need that large of a dictionary? The
original poster sending me these figures agrees that a lot of those
146,626 base words are not needed. So, lets say that we reduce the base
list to down to 35,000, than the numbers are...
 248,000 
 35,000  
 7.1     

Which has about 3.5 times more words than Aspell English dictionary. True
this is large and is wasteful however it is certainly manageable.

So once again affix compression will save space however the expanded word
lists are manageable with out affix compression. Especially sense Aspell
now mmaps the Word Lists in.

Also Affix compression will not likely work well with Aspell current
suggestion strategy of finding all words with a similar soundslike. This
is because I will still needs to store the soundslike data and each
soundslike will still need to point to all words with that soundslike.
Nevertheless it is still doable, is is just not likely to save enough
space to make it worth it. Please see my email "Compiled Word List
Compression Notes" posted to the Aspell mailing list for some more light
on the subject as well as alternate compression strategies.

So, in order to get the full benefit of Affix Compression the soundslike
data would not be used which will result in suggestions that are not much
better than Ispell's. For languages with phonetic spelling this is not
likely to be a problem however for languages like English which don't have
phonetic spelling of words this is likely to be unacceptable.

I do plan on eventually implementing affix compression, it is just being
put on hold until the rest of the international code is done. If you
really care about affix compression than implement it your self. But first
talk to me so I can make sure you are doing it in a manner that is
consistent with the rest of the aspell library. Please see the emails
titled "Affix Compression" and "More Affix Compression Notes" posted to
the Aspell mailing list for some notes from the author of Ispell on some
of the issues involved.

--------------------------------------------------------------------------

Current Language Data

I am currently trying to build a database of language information. You can
help by sending me an email at kevina at users sourceforge net with the
following information.
   
 1. What 8-bit character sets are typically used for your language and any
    shortcomings those character sets may have.
 2. What additional charters can appear in a word in your language and
    where they can appear such as the "-" in French or the "'" in English.
 3. The level of affection your language has.
 4. How phonetic the spelling of the language is.
 5. Additional considerations that should be taken into account for spell
    checking your language such as allowing run together words.
 6. Links to word lists for you language and a critique of how good you
    think they are.

Current Data

If your language is already here please review your language and let me
know of any errors or missing information based on the above criteria.

+----------+-----------------------+------------+--------+----------+----------+
|          |    Character Maps     |  Special   |Phonetic|Affication|Allow Run |
|          |                       | Characters |Spelling|  Level   | together |
|          |                       |            |        |          |  Words   |
+----------+-----------+-----------+------+-----+        |          +---+------+
| Language |  Default  |  Others   |Border|Other|        |          |   |Middle|
|          |           |           |      |Word |        |          |   |Chars |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
|  Danish  |ISO-8859-1 |ISO-8859-5 | -    |     |No      |High      |Yes| s    |
|          |           |ISO-8859-10|      |     |        |          |   |      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
| English  |ISO-8859-1 |ASCII      |      | '   |No      |2.3       |No |      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
|Esperanto |ISO-8859-3 |           | -    | '   |Yes     |High      |Yes|      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
|  German  |ISO-8859-1 |ISO-8859-2 |      |     |?       |High      |No |      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
|Portuguese|ISO-8859-1 |           | -    |     |?       |High      |No |      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
| Russian  |KOI8-R     |ISO-8859-5 | -    |     |Mostly  |High      |Yes|      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
|Slovenian |ISO-8859-2 |           | -    |     |No      |Low       |Yes|      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
| Spanish  |ISO-8859-1 |           |      |     |Mostly  |High      |No |      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+
|  Welsh   |ISO-8859-14|           |      | '   |Mostly  |High      |No |      |
+----------+-----------+-----------+------+-----+--------+----------+---+------+

--------------------------------------------------------------------------
  Current Plans | Notes | Current Language Data | Back to Aspell's Home   
--------------------------------------------------------------------------

--- 
Kevin Atkinson
kevina at users sourceforge net
http://metalab.unc.edu/kevina/
[Prev in Thread]
Current Thread
[Next in Thread]
[aspell] Updated Aspell's International Plans, Kevin Atkinson <=
- [aspell] Re: [aspell], Mohsen Emadi, 2000/11/16
Prev by Date: Re: [aspell] RE:help
Next by Date: [aspell] compile error
Previous by thread: [aspell] help
Next by thread: [aspell] Re: [aspell]
Index(es):
- Date
- Thread