freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Matching/indexing strategies (cont.)


From: Kemper DOC (Nerim)
Subject: [Freecats-Dev] Matching/indexing strategies (cont.)
Date: Tue, 8 Jul 2003 11:37:24 +0200

Hi all,


My short answer to Tim and Stanislav:

> >  -- Trujillo also presents a convincing case for "Stoplists",
> > i.e. a short list of words to be ignored for each language
> > treated. (He proposes "a, nd, an, by, from, of, or, the,
> > that, to, with" for English.) This does contradict a General
> > Guideline(TM) that was mentioned at the meeting a
> > couple of weeks ago, namely that whatever we develop
> > should be as language-independent as possible. However,
> > consider the example given:
> >   Input: "Delete all the files in the folder."
> >   Match1: "Put all the cartridges in the safe."
> >   Match2: "Delete folder files."
> > Match2 is clearly better to the human eye, but it has only
> > three words in common with the Input, whereas Match1
> > has four. Match1 is also closer in overall length to the
> > input. Without eliminating "always irrelevant" words,
> > I'm not sure how we could get round this.
>
> This is language independent. The list will be simply empty for
> some languages. And matching for English would improve quite
> a lot.

Well, this must be an exception to the "no language-dependant" General
Guideline(TM).

I rather like the idea. Building up a list of "tool words" per language is
feasible.

As we don't want to become too language-dependant, I suggest to lower the
weight of each such tool word (for instance by half), but to keep indexing
them anyway (or any similar solution), so that we decrease their relative
importance but still allow the end-user to play with them.


> - he also proposes several (clearly language specific) treatments
> for generating canonical forms of words (i.e. recognizing that
> "read/reads/reading" are all basically the same thing, as are
> "be/being/is/am/are/was/were" etc.

This might be much more tricky to implement.

In some of the Western languages I know or vaguely heard of, I can (in a
quick and dirty way) put several of them on a scale of increased complexity
of conjugation (conjugaison in French), from the simplest to the most
complicated:
English
French, Italian, probably Spanish
German
Finnish, Hungarian

So, as a general suggestion, I would appreciate if, each time a team member
suggests such a feature, he/she tries to evaluate it against a few natural
languages in the "moderately" and "awfully" complicated categories. I
believe Janet Ormrod, from ENSTB, might be most helpful on this.

At the very least, if we are able to express indexing-related processings as
a set of rules and parameter values, each time we want to add a new
language, we'll be able to reuse an existing set of rules (& parameter
values), or to create a new one. For instance:
Set 1   English
Set 2   French, Spanish, Italian, Portuguese ("latin" languages)
Set 3   German
... etc.
Set 47  Chinese - for release 13.47.124, hopefully ;-)

A last thing for now, about natural languages:

- We will include Esperanto. Did Tim tell you he's very fluent in Esperanto
and localizing free software into this language?

- We will include Klingon. This is for marketing purposes - yes, I'm
serious. Can you imagine?
"The first Klingon-compatible CAT software in the galaxy"


Cheers,

Henri Chorand





reply via email to

[Prev in Thread] Current Thread [Next in Thread]