aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell] Compund words


From: Kevin Atkinson
Subject: [aspell] Compund words
Date: Fri, 16 Jun 2000 08:49:24 -0400 (EDT)

On Thu, 15 Jun 2000, Geoff Kuenning wrote:

> The other issue with German compounds is the formation of really long
> words, like the famous Donaudampfschiffgesellschaftmeister (please
> forgive me if I misspelled it!).  Ispell doesn't currently support
> these, for reasons of performance having to do with how I originally
> implemented compound words (they are a special case of the "missing
> space" code).  The basic problem is that when you're working with a
> word like "gesellschaftmeister", ispell doesn't know whether it should
> break it up as "gesellschaft meister" or "gesell schaft meister" (is
> "schaft" a legal word?  I forget.)  It wouldn't be hard to add code to
> try all such possibilities, but I fear that it would run fairly
> slowly.  The basic algorithm would be something like this:
> 
>       for i = 1 to length of word
>           pull off first i characters from the front into "temp"
>           if "temp" is a legal word
>               pull off remaining characters into "rest"
>               call yourself recursively with "rest"
>               if "rest" can be broken into words, consider the whole
>                 thing a legal compound
> 
> This algorithm will definitely work, but it will be slow in situations
> where the word has a lot of prefixes that are legal words.

Instead of just guessing how about finding out.  It can be done in linear
time with the attached prefix.cc which takes a sorted word list in stdin.

When run on a German word list I get this:

 8      übersättigender übersättigende übersättigend übersättigen
        übersättige übersät übers über übe
 8      übersättigendes übersättigende übersättigend übersättigen
        übersättige übersät übers über übe
 9      weitererzählendem weitererzählende weitererzählend weitererzählen
        weitererzähle weiterer weitere weiter weite weit
 9      weitererzählenden weitererzählende weitererzählend weitererzählen
        weitererzähle weiterer weitere weiter weite weit
 9      weitererzählender weitererzählende weitererzählend weitererzählen
        weitererzähle weiterer weitere weiter weite weit
 9      weitererzählendes weitererzählende weitererzählend weitererzählen
        weitererzähle weiterer weitere weiter weite weit
 9      zurückhaltenderem zurückhaltendere zurückhaltender zurückhaltende
        zurückhaltend zurückhalten zurückhalte zurück zur zu
 9      zurückhaltenderen zurückhaltendere zurückhaltender zurückhaltende
        zurückhaltend zurückhalten zurückhalte zurück zur zu
 9      zurückhaltenderer zurückhaltendere zurückhaltender zurückhaltende
        zurückhaltend zurückhalten zurückhalte zurück zur zu
 9      zurückhaltenderes zurückhaltendere zurückhaltender zurückhaltende
        zurückhaltend zurückhalten zurückhalte zurück zur zu

OK so it looks like for some words it is pretty bad here is what the total
distribution looks like.

./a.out < german.wl | cut -b1-2 | sort | uniq -c

  31204  0
  59328  1
  82950  2
  55998  3
  38768  4
  17881  5
   6985  6
   1678  7
    104  8
      8  9

So it still looks pretty bad however 9 prefixes mean that roughly
  size of compound word * 9 
possibilities are being tried (assuming that there are no other 
prefixes in the components of the compound that are not at the
beginning, which may not be the case but read on)
so with a word of size 50 there are 450 possibilities being
tried.  That may seam like a lot but when coming up with words that are 1
edit distance (as Ispell does when coming up with near misses) apart
there are roughly
  2 (n-1) + n + n E + (n+1) E
combinations being tried where n is the size of the word and E is the size
of the alphabet which will be at least 26.

Thus the number of combinations being tried when coming up with
suggestions is more than the number tried when attempting to see if a
compound word can be broken.  Now considering that such compounds are not
that common (am I correct) the running time for spell checking a document
is still perfectly acceptable.

Now if other prefixes are found in some of the inner words the number of
combinations will go up even more however there would have to be a lot of
prefixes to make the running time unacceptable, especially considering that
such wards would be very rare.

However, in order to really tell how many different combinations are being
tried one will have to try it out on an extremely large number if words and
just count the number of combinations tried.  If you are interested in
helping me count these please let me know.

Now for those of you are interested the worst case running time is:
  T(s < l, l) = 0
  T(s    , l) = sum T(i,l) where i = l to l-s
where s is the length of the word and l is the number of possible
combinations when a limit of n is set on the length of individual
components.

The worst case is when every single possible prefix is in the dictionary.

Here is a chart of fir s = 1 .. 50 and l = 1 .. 5

s\l 1              2              3              4              5
 1  0              0              0              0              0
 2  1              0              0              0              0              
 3  3              0              0              0              0              
 4  7              1              0              0              0              
 5  15             2              0              0              0              
 6  31             4              1              0              0              
 7  63             7              2              0              0              
 8  127            12             3              1              0              
 9  255            20             5              2              0              
10  511            33             8              3              1              
11  1023           54             12             4              2              
12  2047           88             18             6              3              
13  4095           143            27             9              4              
14  8191           232            40             13             5              
15  16383          376            59             18             7              
16  32767          609            87             25             10             
17  65535          986            128            35             14             
18  131071         1596           188            49             19             
19  262143         2583           276            68             25             
20  524287         4180           405            94             33             
21  1048575        6764           594            130            44             
22  2097151        10945          871            180            59             
23  4194303        17710          1277           249            79             
24  8388607        28656          1872           344            105            
25  16777215       46367          2744           475            139            
26  33554431       75024          4022           656            184            
27  67108863       121392         5895           906            244            
28  134217727      196417         8640           1251           324            
29  268435455      317810         12663          1727           430            
30  536870911      514228         18559          2384           570            
31  1.07374182e+09 832039         27200          3291           755            
32  2.14748365e+09 1346268        39864          4543           1000           
33  4.2949673e+09  2178308        58424          6271           1325           
34  8.58993459e+09 3524577        85625          8656           1756           
35  1.71798692e+10 5702886        125490         11948          2327           
36  3.43597384e+10 9227464        183915         16492          3083           
37  6.87194767e+10 14930351       269541         22764          4084           
38  1.37438953e+11 24157816       395032         31421          5410           
39  2.74877907e+11 39088168       578948         43370          7167           
40  5.49755814e+11 63245985       848490         59863          9495           
41  1.09951163e+12 102334154      1243523        82628          12579          
42  2.19902326e+12 165580140      1822472        114050         16664          
43  4.39804651e+12 267914295      2670963        157421         22075          
44  8.79609302e+12 433494436      3914487        217285         29243          
45  1.7592186e+13  701408732      5736960        299914         38739          
46  3.51843721e+13 1.13490317e+09 8407924        413965         51319          
47  7.03687442e+13 1.8363119e+09  12322412       571387         67984          
48  1.40737488e+14 2.97121507e+09 18059373       788673         90060          
49  2.81474977e+14 4.80752698e+09 26467298       1088588        119304         
50  5.62949953e+14 7.77874205e+09 38789711       1502554        158044 

The source that created this is also attached.  It turns out that it is
also equal to:

T(s,l) = t(s-l+1, l) - 1
         where t(-l..l) = 1
               t(n)     = t(n-1) + t(n-l)

Which should have an explicit formula.

I am not quote sure why this is .  I discovered this pattern by searching
for known sequence at Sloane's On-Line Encyclopedia of Integer Sequences
(http://www.research.att.com/~njas/sequences/).

If someone has some spare time this will be an interesting thing to
prove.

--- 
Kevin Atkinson
address@hidden
http://metalab.unc.edu/kevina/








Attachment: prefix.cc
Description: Text document

Attachment: split.cc
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]