freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Translation Units indexing - a first draft


From: Henri Chorand
Subject: [Freecats-Dev] Translation Units indexing - a first draft
Date: Thu, 23 Jan 2003 00:55:45 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi all,

I posted my previous message to answer a very nice feedback from Charles Stewart, who posted an answer to David's message on Advogato.

Charles is a very good example of the kind of senior developers we are presently trying to get interested in our project in order to see it really lift off from the ground. In fact, as a scientist, he is involved in various computing-related research work, as you can see at:
http://www.linearity.org/cas/

So, I hope he can (and several other) join us and provide most valuable contributions. Anyway, it's another encouraging event, and it shows that Free CATS project's aim is relevant, to begin with.

Today, I had a couple of phone talks with Julien Poireau, and (among other things) we tried to see how we could index source segments in translation units. This helped me to produce a first draft (see below). While being still a long way from a detailed algorithm, I hope my suggestion can at least help stimulate our brains.

David Welton recently sent us a link to Zebra, a text database server released under GPL license which seems to incorporate many of the features we're looking for (if you look at its documentation pages).

As for any existing, big piece of software, which we look at so as to determine whether we could adapt and use, it takes a lot of time simply to read its documentation in order to understand how it works and how we could adapt it, and this does not prevent us from thinking on our own how to build a database server from scratch.

If we are to adapt such a software, we still need to be able to map our concepts with the existing product's ones, and to express very carefully and with some details what still needs to be done on it in order to obtain what we want - and this, whether or not we obtain help from this project's team.


----------------------------------------------------------------
 Source segments indexing method by a Translation Memory server
----------------------------------------------------------------

(Sorry if you meet improper English terms, this is a DRAFT)

1) Parsing
We need to parse the source segment and to split it into a sequence (a sorted list) of items (words, separators and tags)
Definition:
Word            sequence of contiguous alphabetic and/or numeric characters
Separator       sequence of contiguous non-alphabetic and non-numeric characters
                        space
                        tab
                        punctuation marks . , ; : ! ? inverse? inverse!
                        ' " < > + - * / = _ ( ) [ ] { } hyphens & similar 
various symbols (etc.)
                        non-breakable space
Tag tag belonging to our list of internal tags (let's consider our file is already converted)

For each item:
If it's a word, we extract a list of all sub-words (sub-strings) which length is >= (language-specific minimum)
        (for example: 3 pour french, english)
        (we consider each word and each of its sub-words to be alphabetic 
strings)
        If a tag, we convert it into one of the following items:
                standalone
                beginning-type tag
                end-type tag


2) Indexing

For each word and sub-word, we create an index entry pointing towards the TU's ID (basic step in order to be able to retrieve the TU during queries).
For each TU, we also create the following index entries:
comprehensive list of all values indexed for this TU (will make queries faster).

When creating a TU, the server automatically assigns it an ID. So, we do NOT care about the sequence of items in the source segment. This may seem weird, but in fact, the more two source segments share the same words (contents), the more probably they are to share the same contents, or a similar contents.


3) Looking for matches (fuzzies)

For ANY query (whether looking for a TU or in a context search):
Let's call a starting segment the one for which we are looking for matches.

We build a comprehensive list of all words and sub-words in the starting segment.

We look for all the TU that have
        the highest number of matches at the string level
        (in which an index exists for the starting string considered)
AND that have
        the smallest number of non-matches
(index entries of other TUs not found in the list of starting segment's index entries)

We then determine penalties for:
        any variation in the respective SEQUENCES of words and subwords within 
TUs
        any variation in separators and tags
of course, the score of a sub-word is lower than the one of a full word (by half, for instance)

And we can then sort the full contents of the TM by decreasing relevance order against our starting segment.

Apart from that, we need to build indexes in a way which allows large increases in a TM's number of TUs. Think about it: Trados often forces translators to reorganize its TM indexes...


I hope I'm clear enough.


Regards,

Henri Chorand





reply via email to

[Prev in Thread] Current Thread [Next in Thread]