bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17742: Acknowledgement (Support for enchant?)


From: Eli Zaretskii
Subject: bug#17742: Acknowledgement (Support for enchant?)
Date: Tue, 20 Dec 2016 17:40:12 +0200

> From: Reuben Thomas <rrt@sc3d.org>
> Date: Mon, 19 Dec 2016 21:47:42 +0000
> Cc: 17742@debbugs.gnu.org
> 
>     neither GNU Aspell nor hunspell offer any way to get this information 
> (about character classes of dictionaries) via their APIs.
> 
>     They provide this information in the dictionaries, and we glean it
>     from there.  See ispell-parse-hunspell-affix-file and
>     ispell-aspell-find-dictionary.
> 
> ​The dictionaries are not part of the API (even where the format is 
> documented, the location may not be fixed), so it's not a good idea to rely 
> on them.

If there's no better way, then I see no problem in relying on the
dictionaries, and de-facto the results are satisfactory.

> ​Having discovered that Aspell does not provide this information (I checked 
> again, and ispell-aspell-find-dictionary does not find this information in 
> the dictionaries, except for limited information about otherchars; for 
> casechars and not-casechars it defaults to [:alpha:]), I shall investigate 
> with the hunspell maintainers.​

Aspell provides some of that, and there's no reason to ignore what it
does provide.

> ​Currently, using casechars = [[:graph:]], if I put point over part of the 
> string " (XP) ", and run M-x ispell-word, it says "(XP) is correct". That's 
> good enough for me!

Whether it's good enough depends on the dictionary and on what "(XP)"
means.  It could be that "(XP)", including the parentheses, is a word
the dictionary recognizes, something akin to "(C)", i.e. copyright
sign.  And it could be that the correct word is "XP", with the
parentheses acting as punctuation.  And there could be additional
alternatives.  Only the dictionary "knows" what is the right
alternative, and ispell.el should abide by the dictionary's rules, or
else it will not do what the user wants.  E.g., "XP" could not be in
the dictionary (as in fact I get when I try that with Hunspell), but
"(XP)" is.  So CASECHARS should be set up according to what the
dictionary expects, or you will have false positives and false
negatives.

> Note that merely using the characters declared in the dictionary may not be 
> enough: I have words like SC³D (I spell my company that way) in my personal 
> word lists. Other users might be more imaginative, and for example have 
> sequences of emoji. The list of characters in the dictionary is only a 
> minimum.​
 
That's why personal word list go together with dictionaries: they both
must use the same affix rules, so if you change to another dictionary
for the same language, your personal word list should also change, or
else you will get false negatives.

>     So we do need this information.  If Enchant doesn't provide it, we
>     could still use the same technique as with Aspell and Hunspell,
>     provided that we can figure out which back end(s) is/are used by
>     Enchant.  Is that doable?
> 
> ​Yes, that can be done, but it's fragile; that's why I'm trying to avoid it.​

I don't see why it would be fragile with Enchant when it isn't with
its back-ends.  And avoiding even fragile methods is worse than using
them, when there's no better way of gleaning the same information, and
the information is important (as it is in this case).

>     Ispell.el also supports spell-checking by words, in which case the
>     above is not useful, because we need to figure out what is a word.
>
> ​See above. It's not clear to me that we need a very precise idea of what 
> constitutes a word.​

I think you are drawing too radical conclusions from trying that with
a single word and a single dictionary.  Which string was sent to the
speller in this case, and is that the string you expected to be sent?

>     Moreover, even when we send entire lines to the speller, we want to
>     skip lines that include only non-word characters.
>
> ​Why?​

To avoid false positives and false negatives, as explained above.

>     Hunspell is the most modern and sophisticated speller, we certainly
>     don't want to degrade it.
> 
> ​No chance of that, this patch is only about Enchant.​

First, Enchant could be using Hunspell as its engine, right?

And second, AFAIU this discussion started by you proposing to get rid
of CASECHARS etc., for all spellers, not just for Enchant, something
that will definitely cause degradation.

>       Also, Aspell uses the dictionaries at least
>     for some of this info, see the function I pointed to above.
>
> ​Only for otherchars, not casechars/not-casechars.​

Partial information is better than no information, IMO.

>     Bottom line, this information cannot be thrown away or ignored.  It is
>     important for correctly interfacing with a dictionary and for doing
>     TRT as the users expect.  Any modern speller program would benefit
>     from it, and therefore we should strive to provide such information to
>     ispell.el whenever we possibly can.
>
> ​It is not a question of throwing away or ignoring information: the 
> information is simply not available through documented channels (at least for 
> Enchant). Yes, one can find the underlying engine and then use that 
> information to (try to) find the dictionaries, but one is then making a 
> number of brittle assumptions. And it's not clear that the information is 
> actually necessary to have.

It sounds like the important part of our disagreement is in the last
sentence.  If so, I hope I've succeeded to change your mind.  Failing
that, all I can suggest is to study the spelling rules of modern
speller, such as Hunspell, and see how this information is used there.

> It would be helpful if you could show a situation in which using [:graph:] 
> for enchant dictionaries. actually misbehaves in some way.

I tried to explain that above: you will get falses and/or irrelevant
or missing corrections from the speller.  For example, if you send
"foo.bar", and the speller doesn't support '.' as a word-constituent
character, you will get separate suggestions for "foo" and "bar", and
won't get "foobar".

I also don't understand why you want to remove this information, that
is already there, is not harder to get with Enchant than it is without
it, and the code which supports it is already there?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]