freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] OmegaT indexing features (from Keith's post)


From: Henri Chorand
Subject: [Freecats-Dev] OmegaT indexing features (from Keith's post)
Date: Wed, 26 Feb 2003 23:35:29 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi Keith,

I'm very happy to see the close similarities between what we began to plan and what you actually built.

To put it in a nutshell: Free CATS dreamed about it, and Keith did it

And if you read further below, I think you'll agree with it  ;-)

>> (...) By hierarchical structure I mean the parse trees that
>> linguists represent using
>> x-bar grammars, eg.:
> (...)

> I agree with Henri here - a long while ago I tried to get into
> parsing sentences to achieve good 'fuzzy' matching, but I ended
> up spending a lot of time going nowhere.  Being a mono-linguist
> doesn't help, obviously, but even trying the technique when
> 'translating' technical babble into something lay-person
> understandable, using grammatical analysis didn't significantly
> help fuzzy matching despite the rather excessive up-front
> preparation.

I may add (to Charles) that several other brilliant minds thought about this "ideal" approach. Along this path, some of them assumed the "real" world could be modelled, and even that language was a good representation of reality. The last "philosophy" bits I read on this subject (incidentally, I started from Python language's Web site and found a nice study on the Monty Python and the underlying philosophical principles behind their crazy stuff) ended convincing me that this task was possibly not even an achievable one.


> With OmegaT I opted for an idiot solution - one a computer is good
> at as it requires no understanding, only indexing and plenty of
> boring computation.  The first step of this algorithm involves
> building an index of all words within a project and where those
> words occur (for example, the word "butter" might occur in segments
> 3, 88 and 91, so the index has "butter: 3, 88, 91").

We thought about that too (with possible implementation variations, whatever) and do like it that way. Thierry Sourbier (XML expert, webmaster of http://www.i18ngurus.com/) then suggested an idea which we quite like and that also could fit nicely in the picture: You take full words here. If you want to take a word's derived forms (singular instead of plural, a verb's conjugated forms, etc.) and if you do NOT want to add a semantic layer that would nicely recognize each of these, why not take sub-strings (N-Grams according to Thierry) and let statistics do a great job at nearly all times?

The thing to be refined here is which substrings you take into account. I suggested a basic system with the following N-Gram extraction parameters, the idea is to play with it and possibly to adapt it to any given natural language. Here are my parameters:
param1 = minimum N-Gram length (I suggested 3)
param2 = take N-Grams starting at full word's beginning
param3 = take N-Grams starting at full word's end
param4 = take N-Grams starting in full word's middle

For instance, for French, you might have: (4, yes, yes, no). For German, you might have (4, yes, yes, yes), because German has an awful lot of compound words and, doing things that way, you might increase the fuzzy search's capacities.

To use your example below, let's take the word "approach", you may extract (beginning of word):
approac approa appro appr

and if you really want it (language-dependent), you might also extract (end of word):
pproach proach roach oach

and even (middle of word):
pproac proac roac pproa ppro proa

You'll find something along these lines in Free CATS' DB Indexing document. So one has to build indexes for each sub N-Gram (statistics show that the total number of N-Grams within a TM is not as bigger than the total number of words within a TM as one might think.

The only other overhead is that you need to weigh these in order to have the same total weight for all N-Grams obtained from any given full word, otherwise the weight of "short" words is going to look miserable when compared to the one of "long" words.

Please tell us what you think about it, and if ever you came with a similar solution. Marc Prior will certainly have something to say here, as I understand he works with German language, full of "déclinaison" (sorry for this French word, but my English linguist vocabulary is VERY bad and the "Grand Dictionnaire" contains near to nothing in this domain).

> For fast access, each word is stored in a hash table as opposed to
>  a tree or list - this eats a lot of memory but provides excellent
> performance (which is important to scalability)

Yes, definitely.

> When making a 'fuzzy' match, I take a candidate string and perform
> a search on each word in that string within the entire project
> space and quickly end up with a list of segments that include one
> or more words from the candidate string.

Yes, I thought doing the same with both full words and substrings.

> I arbitrarily drop any segment with less than half of the words
> the same and then continue processing this string subset.

Fine, this threshold value looks like a good rule of thumb.

> Two strings having many of the same words doesn't necessarily make
> them similar - what I've found tends to make them similar is the
> number of identical word pairs shared between two strings (at
> least this seems to hold in English).

This looks great.
Another "dumb" remark here. Especially for long sentences and technical documentation, the probability for having two sentences sharing several words (whatever their respective sequences) MAY be interesting for the translator in that it might simply mean less retyping, OR because that way, he is thus automatically reminded of some project terminology entries. So, word sequence discrepancies should probably be used more to lower a fuzzy segment's match score than to elimitate this segment.

> I then analyze the word pairs in the candidate string and the
> remaining segments.  From this analysis, OmegaT then displays
> the "fuzzy" match strings for the translator to see, highliting
> where words are added and removed (the source and target strings
> show what is unique to each) and also what words are similar but
> that don't have the same neighboring words.

Fine. We Trados users already find very useful a simple highlighting system where similar parts are highlighted, and I must add that Trados behaves very idiotically here - when you have several occurrences of a word, it does not try to select the one that follows the sequence, it only highlight the first one found.


> For example, these two similar strings have the following
> differences and would receive respectable matching scores (~70%)
> even though they are too different to practically recycle anything
> translations between.

> (snipping all the funny bits here)
> (there is red and underline formatting in the above sentences -
> if it's stripped out, the comparisons might not make much sense)
> Two sentences sharing identical words but in different orders
> would likely mean very different things in English (and most
> other languages as well, except Latin) and would have very low
> matching scores.

See my "dumb" remark above.

> Not science, magic or even high-tech.  It does seem to work
> reasonably well, however, and because all processing is done
> when the project opens, the information is immediately available
> to the translator when a segment comes up for translation - the
> delay should be 'constant' (and less than half a second) no
> matter how large the TM may be.

Do you mean that you pre-index the source segments in your bilingual document as soon as you build it?

Of course, this provides better response times, but it may not be ideal for retrieving all that there is to find in the TM, especially with a large project and/or several translators.

> This "fuzzy matching" solution doesn't not lend itself very
> well to environments where the user is able to move the segment
> markers - it can be accomodated (I have the technique) for but
> it will give the developer a serious migrain while trying to
> make it possible.

We thought about that for a while. We ended up with the following:
- moving the delimiters anywhere (not only just after a segment delimiter) is not really what we need - better allow the translator to merge two adjacent TUs (within a single paragraph) and cancel such a merge.

> BTW - I'm not subscribed to the list so if someone wishes to
> contact me, please do so directly.

Well, subscribing might be the simplest solution - up to you!


As a whole, I find Keith's work on OmegaT very much in line with our present design, and some of our ideas might also fit into the picture without requiring to start all over again.

I'll wait for other project team members to express their thoughts.

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]