|
From: | Maxim Nikulin |
Subject: | A soundslike problem with combined English+Russian dictionary |
Date: | Tue, 22 Jun 2021 23:56:25 +0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 |
Hi,I am aware that multi-lingual dictionaries are unsupported by Aspell, but I think in some particular cases it is still possible to combine a couple of dictionaries and to get a result of reasonable quality. I am almost achieved what I expected for merged English and Russian word lists. I am quite satisfied even with current result. Maybe I just have not discovered detrimental effect of missed affix table for English or combined special characters ("-" and "'").
I was hooked by description of the metaphone algorithm that should improve suggested corrections for misspelled words. Since I am not a native English speaker, I do not mind to have such feature if it helps to remind some word. For Russian general edit distance should be enough, so I tried to use a copy of en_phonet.dat with added line (and exact copy as well)
remove_accents 0 that is referenced in the .dat file soundslike rue_phonetTo my surprise with such configuration whole English alphabet is suggested as a replacement for misspelled Russian word. In the following example word "funetik" is taken from the manual to check that phonetic rules are taken into account (another example taff -> tough does not work with default suggestion mode)
echo "funetik програма" | aspell -d ./rue.rws -a @(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8) & funetik 26 0: fanatic, funk, fungi, Fuentes, functor, frenetic, genetic, kinetic, finite, fount, fungoid, funky, lunatic, phonetic, fountain, funked, Fundy, fined, founts, funded, font, fund, frantic, funkier, fount's, Fuentes's & програма 100 8: программа, программ, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, X, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, x, z, AA, AI, AR, Ar, Au, BA, BB, BO, Ba, Be, Bi, CA, CO, Ca, Ce, Ch, Ci, Co, Cu, DA, DD, DE, DI, Di, Du, Dy, ER, EU, Er, Eu, FY, Fe, GA, GE, GI, GU, Ga, Ge, HI, Ha, He, Ho, IA, IE, Ia, Io, Ir, Jo, KO, KY
That is why I have soundslike genericis my current configuration and it gives more reasonable variants for Russian test word:
& програма 13 8: программа, программ, программе, программу, программы, программах, программам, программка, параграмма, программою, проиграна, параграмм, погрома
Have I done something wrong? Is it expected behavior that English phonetic rules have so detrimental effect on variants for Russian words? I am unsure whether observed result is a bug. (Actually the question is: `How many bugs have I faced?' With zero as a possible variant)
More details of my configuration.The goal is to see misspelled words in mixed-language documents with my notes. Variants of correction are appreciated as well. It works in Vim for years:
set spelllang=en,ru spell and I would like to have comparable feature in Emacs M-x flyspell-mode RET M-x ispell-change-dictionary RET ruewithout special configuration of custom dictionary in Emacs. Side note: certainly I am against idea, I have seen once, to bind ispell dictionary to input method.
There is a feature request for support of multi-lingual dictionaries https://github.com/GNUAspell/aspell/issues/448 (and a number of similar threads in the archive of this mail list). People are still trying to combine dictionaries: https://unix.stackexchange.com/questions/341714/use-multi-language-dictionary-with-aspell https://wiki.archlinux.org/title/User:GeorgekThere is no section in the manual that clarifies possible problems of this approach.
I hope, in my particular case of English and Russian languages it can be done in a bit more accurate way.
- I rarely use letters with accents, so alphabets are disjoint set of characters. US-ASCII is a subset of KOI8-R encoding. - The cost of discarding of affix data for Russian is ~30M of disk space (and almost certainly RAM as well). I am unsure if I loose something by ignoring affix table for English. - Combined "special" is a kind of compromise, it should be per-language, I have not example of imperfect behavior yet however. - As I said above, I would prefer phonetic rules for English but I have to use generic ones.
--->8--- rue.dat begin --->8--- # Combined dictionary for English and Russian languages # # An attempt to create a dictionary suitable for spell checking # of mixed-language texts. # # Something distinct from just "ru" and "en". Do not use a name longer # than 3 characters otherwise it will not appear in "aspell dump dicts" # thus will be ignored by other applications. Numbers, e.g. "ru2" # make language identifier invalid as well. name rue # ISO8859-1 used for "en" dictionaries is a subset of KOI8-R # modulo accents. # Russian dictionary from system package on Ubuntu uses namely KOI8-R. charset koi8-r # Combine values from "ru" and "en" special - -*- ' -*- # With # # soundslike rue # # and a copy of en_phonet.dat aspell suggests # e.g. "phonetic" for "funetik" input. # Unfortunately it ruins scoring of corrections for Russian. # Even with "remove_accents 0" inside "rue_phonet.dat", abundant # single- and two-letters variants appear as alternatives. # However a couple of top rated suggestions are still reasonable. # Segfault may happen on attempt to generate master dictionary # when "rue_phonet.dat" is missed in the current directory. # As a compromise, prefer better quality of correction variants # for Russian. soundslike generic # Affix compression is not enabled for "en" system dictionaries. # At the same time it allows to save enough space for "ru" dictionary. # Size of compressed dictionary is 3Mb, expanded one consumes 30Mb # of disk space. # # aspell --lang=ru --encoding=koi8-r dump master \ # | aspell --lang=ru --encoding=koi8-r expand \ # | aspell --lang=ru create master ./ru.rws # # aspell --lang=ru --encoding=koi8-r dump master \ # | aspell --lang=ru --encoding=koi8-r expand \ # | tr ' ' '\n' \ # | aspell --lang=ru create master ./ru-expand.rws # affix-compress true # Actually it is ignored and "rue_affix.dat" # (copy or symlink is required). affix ru # Noticed differences: # # echo "programm funetic" | aspell --lang en -a # & programm 5 0: program, programs, programmer, programmed, program's# & funetic 14 9: fanatic, frenetic, genetic, kinetic, lunatic, phonetic, frantic, fungi, Fuentes, antic, functor, fanatics, fungoid, fanatic's
# # ------------------------------------------------------------^^^^^^^^ # # echo "programm funetic" | aspell --lang rue -a# & programm 6 0: program, programs, programmed, programmer, program's, pogrom # #---------------------------------------------------------------------^^^^^^
# & funetic 5 9: fanatic, genetic, kinetic, lunatic, Fuentes # # Absence of "phonetic" caused by "soundslike generic". "Pogrom" presents # in the original "en" word list. ---8<--- rue.dat end ---8<--- --->8--- rue.multi begin --->8--- # Combined dictionary for English and Russian languages # # It is not possible to just add ru.multi and en.multi # because of languages # inside the dictionaries differ. Unsure if it is safe to generate # dictionary for English language using modified ru.dat # with "special ' -*-". # Let's generate dictionaries with "rue" as a language identifier. # # System-wide .rws files are created on Ubuntu in postinst scripts by # /usr/sbin/update-dictcommon-aspell and /usr/sbin/aspell-autobuildhash # utilities. Source word lists are provided # in /usr/share/aspell directory. # Example of command to unpack: # # zcat /usr/share/aspell/en-wo_accents-only.cwl.gz | precat # # E.g. en_US dictionary is combination of en-common # (shared with e.g. en_GB) # and en-wo_accents-only. Unsure if I need this degree of word list # granularity, so let's try a naive approach to create word lists. # # "rue_affix.dat" is required despite "affix ru" line in rue.dat # # ln -s /usr/lib/aspell/ru_affix.dat rue_affix.dat # # aspell --lang=ru --encoding=koi8-r dump master \ # | aspell --lang=rue create master ./rue-ru.rws # # Despite warnings like # # # Warning: Removing inapplicable affix 'H' from word Адель. # # expanded word list is the same as the original one. add rue-ru.rws # Specify encoding to avoid UTF-8 if some accents # will appear accidentally. # # aspell --lang=en_US --encoding=iso8859-1 dump master \ # | aspell --lang=rue create master ./rue-en_US.rws add rue-en_US.rws ---8<--- rue.multi end ---8<---Commands to generate word lists are in the last comments in rue.multi. Finally I can run
aspell --lang rue -aDoes such configuration have apparent problems? Is it possible to use en_phonet.dat instead of "generic" for soundslike?
[Prev in Thread] | Current Thread | [Next in Thread] |