[Freecats-Dev] OmegaT indexing features (from Keith's post)

freecats-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] OmegaT indexing features (from Keith's post)

From:	Henri Chorand
Subject:	[Freecats-Dev] OmegaT indexing features (from Keith's post)
Date:	Wed, 26 Feb 2003 23:35:29 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi Keith,

I'm very happy to see the close similarities between what we began toplan and what you actually built.


To put it in a nutshell: Free CATS dreamed about it, and Keith did it

And if you read further below, I think you'll agree with it  ;-)

>> (...) By hierarchical structure I mean the parse trees that
>> linguists represent using
>> x-bar grammars, eg.:
> (...)

> I agree with Henri here - a long while ago I tried to get into
> parsing sentences to achieve good 'fuzzy' matching, but I ended
> up spending a lot of time going nowhere.  Being a mono-linguist
> doesn't help, obviously, but even trying the technique when
> 'translating' technical babble into something lay-person
> understandable, using grammatical analysis didn't significantly
> help fuzzy matching despite the rather excessive up-front
> preparation.

I may add (to Charles) that several other brilliant minds thought aboutthis "ideal" approach. Along this path, some of them assumed the "real"world could be modelled, and even that language was a goodrepresentation of reality. The last "philosophy" bits I read on thissubject (incidentally, I started from Python language's Web site andfound a nice study on the Monty Python and the underlying philosophicalprinciples behind their crazy stuff) ended convincing me that this taskwas possibly not even an achievable one.



> With OmegaT I opted for an idiot solution - one a computer is good
> at as it requires no understanding, only indexing and plenty of
> boring computation.  The first step of this algorithm involves
> building an index of all words within a project and where those
> words occur (for example, the word "butter" might occur in segments
> 3, 88 and 91, so the index has "butter: 3, 88, 91").

We thought about that too (with possible implementation variations,whatever) and do like it that way.Thierry Sourbier (XML expert, webmaster of http://www.i18ngurus.com/)then suggested an idea which we quite like and that also could fitnicely in the picture:You take full words here. If you want to take a word's derived forms(singular instead of plural, a verb's conjugated forms, etc.) and if youdo NOT want to add a semantic layer that would nicely recognize each ofthese, why not take sub-strings (N-Grams according to Thierry) and letstatistics do a great job at nearly all times?

The thing to be refined here is which substrings you take into account.I suggested a basic system with the following N-Gram extractionparameters, the idea is to play with it and possibly to adapt it to anygiven natural language. Here are my parameters:

param1 = minimum N-Gram length (I suggested 3)
param2 = take N-Grams starting at full word's beginning
param3 = take N-Grams starting at full word's end
param4 = take N-Grams starting in full word's middle

For instance, for French, you might have: (4, yes, yes, no). For German,you might have (4, yes, yes, yes), because German has an awful lot ofcompound words and, doing things that way, you might increase the fuzzysearch's capacities.

To use your example below, let's take the word "approach", you mayextract (beginning of word):

approac approa appro appr

and if you really want it (language-dependent), you might also extract(end of word):

pproach proach roach oach

and even (middle of word):
pproac proac roac pproa ppro proa

You'll find something along these lines in Free CATS' DB Indexingdocument. So one has to build indexes for each sub N-Gram (statisticsshow that the total number of N-Grams within a TM is not as bigger thanthe total number of words within a TM as one might think.

The only other overhead is that you need to weigh these in order to havethe same total weight for all N-Grams obtained from any given full word,otherwise the weight of "short" words is going to look miserable whencompared to the one of "long" words.

Please tell us what you think about it, and if ever you came with asimilar solution. Marc Prior will certainly have something to say here,as I understand he works with German language, full of "déclinaison"(sorry for this French word, but my English linguist vocabulary is VERYbad and the "Grand Dictionnaire" contains near to nothing in this domain).


> For fast access, each word is stored in a hash table as opposed to
>  a tree or list - this eats a lot of memory but provides excellent
> performance (which is important to scalability)

Yes, definitely.

> When making a 'fuzzy' match, I take a candidate string and perform
> a search on each word in that string within the entire project
> space and quickly end up with a list of segments that include one
> or more words from the candidate string.

Yes, I thought doing the same with both full words and substrings.

> I arbitrarily drop any segment with less than half of the words
> the same and then continue processing this string subset.

Fine, this threshold value looks like a good rule of thumb.

> Two strings having many of the same words doesn't necessarily make
> them similar - what I've found tends to make them similar is the
> number of identical word pairs shared between two strings (at
> least this seems to hold in English).

This looks great.

Another "dumb" remark here. Especially for long sentences and technicaldocumentation, the probability for having two sentences sharing severalwords (whatever their respective sequences) MAY be interesting for thetranslator in that it might simply mean less retyping, OR because thatway, he is thus automatically reminded of some project terminologyentries. So, word sequence discrepancies should probably be used more tolower a fuzzy segment's match score than to elimitate this segment.


> I then analyze the word pairs in the candidate string and the
> remaining segments.  From this analysis, OmegaT then displays
> the "fuzzy" match strings for the translator to see, highliting
> where words are added and removed (the source and target strings
> show what is unique to each) and also what words are similar but
> that don't have the same neighboring words.

Fine. We Trados users already find very useful a simple highlightingsystem where similar parts are highlighted, and I must add that Tradosbehaves very idiotically here - when you have several occurrences of aword, it does not try to select the one that follows the sequence, itonly highlight the first one found.



> For example, these two similar strings have the following
> differences and would receive respectable matching scores (~70%)
> even though they are too different to practically recycle anything
> translations between.

> (snipping all the funny bits here)
> (there is red and underline formatting in the above sentences -
> if it's stripped out, the comparisons might not make much sense)
> Two sentences sharing identical words but in different orders
> would likely mean very different things in English (and most
> other languages as well, except Latin) and would have very low
> matching scores.

See my "dumb" remark above.

> Not science, magic or even high-tech.  It does seem to work
> reasonably well, however, and because all processing is done
> when the project opens, the information is immediately available
> to the translator when a segment comes up for translation - the
> delay should be 'constant' (and less than half a second) no
> matter how large the TM may be.

Do you mean that you pre-index the source segments in your bilingualdocument as soon as you build it?

Of course, this provides better response times, but it may not be idealfor retrieving all that there is to find in the TM, especially with alarge project and/or several translators.


> This "fuzzy matching" solution doesn't not lend itself very
> well to environments where the user is able to move the segment
> markers - it can be accomodated (I have the technique) for but
> it will give the developer a serious migrain while trying to
> make it possible.

We thought about that for a while. We ended up with the following:

- moving the delimiters anywhere (not only just after a segmentdelimiter) is not really what we need- better allow the translator to merge two adjacent TUs (within a singleparagraph) and cancel such a merge.


> BTW - I'm not subscribed to the list so if someone wishes to
> contact me, please do so directly.

Well, subscribing might be the simplest solution - up to you!

As a whole, I find Keith's work on OmegaT very much in line with ourpresent design, and some of our ideas might also fit into the picturewithout requiring to start all over again.


I'll wait for other project team members to express their thoughts.

Henri

[Prev in Thread]

Current Thread

[Next in Thread]

[Freecats-Dev] OmegaT indexing features (from Keith's post), Henri Chorand <=
- Re: [Freecats-Dev] OmegaT, Marc Prior, 2003/02/27

Prev by Date: [Freecats-Dev] Re: Trados/other CAT, Python/Java, German/English
Next by Date: [Freecats-Dev] Free CATS project roadmap - Server dev. (from Stanislav)
Previous by thread: [Freecats-Dev] Free CATS TM server - Development tools
Next by thread: Re: [Freecats-Dev] OmegaT
Index(es):
- Date
- Thread