freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Python; matching/indexing strategies


From: Tim Morley
Subject: [Freecats-Dev] Python; matching/indexing strategies
Date: Tue, 8 Jul 2003 02:26:00 +0200

Hi there.

Just in case we were still in any doubt about Python:
http://peek-a-booty.org/Docs/WhichLanguageDoYouRecommend.htm

:o)


In between getting myself up to speed in Python, Zope and (for good measure)
trying to compare Berkeley DB with mySQL, I've been doing a bit of research
into indexing and matching algorithms. First off was an over-dinner chat
with a competent and experienced developer friend, just about all of whose
ideas were then confirmed and expanded upon in the relevant chapter of
"Translation Engines: Techniques for Machine Translation" by Arturo Trujillo
(Springer, 2000).

To summarise:
 -- it seems silly to try to develop our own file format, indexing/search
routines, etc. without making use of some sort of existing database
software; we'll have to end up managing conflicts and simultaneous access
and so on anyway, so why re-invent the wheel?
 -- fuzzy-matching sentences is a less-than-exact science (but we knew that
already). Trujillo details several coefficients for measuring sentence
similarity which apparently give reasonable results. There are several pages
of closely typed print which I hesitate to copy-type here, but I could maybe
snail-mail photocopies of the relevant pages to interested parties. (Or you
could go and buy your own copy of my former university lecturer's book).
:o)
 -- Trujillo also presents a convincing case for "Stoplists", i.e. a short
list of words to be ignored for each language treated. (He proposes "a, nd,
an, by, from, of, or, the, that, to, with" for English.) This does
contradict a General Guideline(TM) that was mentioned at the meeting a
couple of weeks ago, namely that whatever we develop should be as
language-independent as possible. However, consider the example given:
  Input: "Delete all the files in the folder."
  Match1: "Put all the cartridges in the safe."
  Match2: "Delete folder files."
Match2 is clearly better to the human eye, but it has only three words in
common with the Input, whereas Match1 has four. Match1 is also closer in
overall length to the input. Without eliminating "always irrelevant" words,
I'm not sure how we could get round this.
 -- he also proposes several (clearly language specific) treatments for
generating canonical forms of words (i.e. recognizing that
"read/reads/reading" are all basically the same thing, as are
"be/being/is/am/are/was/were" etc. The second of course can only be treated
by built-in tables of irregular groups of words on a per-language basis, but
Trujillo proposes "successor variety" as a simple routine capable of
producing reasonable results without knowing what language it is working on
for the case of regular inflected forms; it does however require some sort
of corpus before it becomes effective. The technique, briefly, compares each
group of letters with its successor and calculates the number of possible
successors for each group. Its value decreases as the group of letters gets
bigger, but then sharply increases at a morpheme boundary. One can then
choose the longest group within a word and call it the root. Not perfect,
but pretty usable, it would seem.
 -- he then talks about "Inverted files", a good method for indexing a TM.
If we assume our matching routine is going to be based on the
number-of-words-in-common idea (i.e. we're not delving into the murky world
of semantic analysis, in-built thesauri, etc.), this would seem to be the
way to go. It's pretty much what we described at the meeting anyway -- we
keep a list of sentences, and a list of words, and a record for each word of
which sentences it appears in. For the next input sentence, then, we can
easily find a list of possible matches which contain at least one word in
common, thereby avoiding laboriously comparing the sentence with the whole
database. He also mentions a few other (more elaborate) proposals -- binary
search on a sorted file, B-trees and hashing, this latter apparently treated
at length in another book (see below). If we can lay our hands on a copy of
this book, it probably wouldn't do any harm.

Anyway, that's where I'm up to. I'm sure Michael will get this anyway, but
I'm going to get in touch with him and see what we can do for each other.
:o)

Cheers all. A bientôt.


Tim


*For details on hashing (and possibly B-trees, whatever they are):
Frakes W. B. and R. Baeza-Yates (Eds.) (1992) Information Retrieval - Data
Structures & Algorithms. New Jersey: Prentice Hall PTR. See
http://www.amazon.com/exec/obidos/tg/detail/-/0134638379/qid=1057620725/sr=1
-1/ref=sr_1_1/102-4877380-4379349?v=glance&s=books





reply via email to

[Prev in Thread] Current Thread [Next in Thread]