freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Freecats-Dev] Translation Units indexing - a first draft


From: Thierry Sourbier
Subject: RE: [Freecats-Dev] Translation Units indexing - a first draft
Date: Thu, 23 Jan 2003 02:02:14 +0100

A few fast comments:

1) Unicode character properties should be used to determine if a character
is a letter/digit/punctuation mark.

2) Word breaking is a challenging problem (think Japanese, Thai...).

3) You'll need to apply some kind of normalization such as case folding
otherwise: "text" and "TEXT" will never fuzzy match. What about accents?

4) If you want the TM to scale only store in the index the smallest
"sub-words" (N-Gram in the litterature)

Otherwise for a word of 8 characters that's 15 words and subwords that
you'll need to store.
1 - 8 character "word"
2 - 7 characters "sub-words"
...
5 - 3 characters "sub-words"

Even if memory is cheap we shouldn't waste it :).

------
Given 2) and 4) may be you don't care about words anymore and just split
quickly the TU into N-Grams. It worked well for us :).
------

Optimization: All N-Grams are not born equal and some are more frequent than
other in each language. Based on this some optimization methods can be used
to filter out the less relevant ones e.g. "ing" in English.

That's my 2 cents,
Cheers,

Thierry.







-----Message d'origine-----
De : address@hidden
[mailto:address@hidden la part de
Henri Chorand
Envoye : jeudi 23 janvier 2003 00:56
A : Free CATS Dev list
Cc : Charles Stewart
Objet : [Freecats-Dev] Translation Units indexing - a first draft


Hi all,

I posted my previous message to answer a very nice feedback from Charles
Stewart, who posted an answer to David's message on Advogato.

Charles is a very good example of the kind of senior developers we are
presently trying to get interested in our project in order to see it
really lift off from the ground. In fact, as a scientist, he is involved
in various computing-related research work, as you can see at:
http://www.linearity.org/cas/

So, I hope he can (and several other) join us and provide most valuable
contributions. Anyway, it's another encouraging event, and it shows that
Free CATS project's aim is relevant, to begin with.

Today, I had a couple of phone talks with Julien Poireau, and (among
other things) we tried to see how we could index source segments in
translation units. This helped me to produce a first draft (see below).
While being still a long way from a detailed algorithm, I hope my
suggestion can at least help stimulate our brains.

David Welton recently sent us a link to Zebra, a text database server
released under GPL license which seems to incorporate many of the
features we're looking for (if you look at its documentation pages).

As for any existing, big piece of software, which we look at so as to
determine whether we could adapt and use, it takes a lot of time simply
to read its documentation in order to understand how it works and how we
could adapt it, and this does not prevent us from thinking on our own
how to build a database server from scratch.

If we are to adapt such a software, we still need to be able to map our
concepts with the existing product's ones, and to express very carefully
and with some details what still needs to be done on it in order to
obtain what we want - and this, whether or not we obtain help from this
project's team.


----------------------------------------------------------------
  Source segments indexing method by a Translation Memory server
----------------------------------------------------------------

(Sorry if you meet improper English terms, this is a DRAFT)

1) Parsing
We need to parse the source segment and to split it into a sequence (a
sorted list) of items (words, separators and tags)
Definition:
Word            sequence of contiguous alphabetic and/or numeric characters
Separator       sequence of contiguous non-alphabetic and non-numeric characters
                        space
                        tab
                        punctuation marks . , ; : ! ? inverse? inverse!
                        ' " < > + - * / = _ ( ) [ ] { } hyphens & similar 
various symbols (etc.)
                        non-breakable space
Tag             tag belonging to our list of internal tags (let's consider our 
file
is already converted)

For each item:
        If it's a word, we extract a list of all sub-words (sub-strings) which
length is >= (language-specific minimum)
        (for example: 3 pour french, english)
        (we consider each word and each of its sub-words to be alphabetic 
strings)
        If a tag, we convert it into one of the following items:
                standalone
                beginning-type tag
                end-type tag


2) Indexing

For each word and sub-word, we create an index entry pointing towards
the TU's ID (basic step in order to be able to retrieve the TU during
queries).
For each TU, we also create the following index entries:
comprehensive list of all values indexed for this TU (will make queries
faster).

When creating a TU, the server automatically assigns it an ID. So, we do
NOT care about the sequence of items in the source segment.
This may seem weird, but in fact, the more two source segments share the
same words (contents), the more probably they are to share the same
contents, or a similar contents.


3) Looking for matches (fuzzies)

For ANY query (whether looking for a TU or in a context search):
Let's call a starting segment the one for which we are looking for matches.

We build a comprehensive list of all words and sub-words in the starting
segment.

We look for all the TU that have
        the highest number of matches at the string level
        (in which an index exists for the starting string considered)
AND that have
        the smallest number of non-matches
        (index entries of other TUs not found in the list of starting segment's
index entries)

We then determine penalties for:
        any variation in the respective SEQUENCES of words and subwords within 
TUs
        any variation in separators and tags
        of course, the score of a sub-word is lower than the one of a full word
(by half, for instance)

And we can then sort the full contents of the TM by decreasing relevance
order against our starting segment.

Apart from that, we need to build indexes in a way which allows large
increases in a TM's number of TUs. Think about it: Trados often forces
translators to reorganize its TM indexes...


I hope I'm clear enough.


Regards,

Henri Chorand



_______________________________________________
Freecats-dev mailing list
address@hidden
http://mail.nongnu.org/mailman/listinfo/freecats-dev





reply via email to

[Prev in Thread] Current Thread [Next in Thread]