freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Re: Trados/other CAT, Python/Java, German/English


From: Keith Godfrey
Subject: [Freecats-Dev] Re: Trados/other CAT, Python/Java, German/English
Date: Tue, 25 Feb 2003 15:11:44 -0600
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01

Hope to send a message on this later this week.  By hierarchical
structure I mean the parse trees that linguists represent using
> x-bar grammars, eg.:

        Sentence
         /    \
    Noun phrase    Verb phrase
    /    \    /    \
    Determiner  Noun  Verb    Noun phrase
    |     |    |     /    \
      The    cat    sat      Prep    Noun phrase
                 |    /    \
                on    Det    Noun
                       |     |
                      the    mat
<snip>
I hope Keith will soon detail us how he did things with OmegaT.

I agree with Henri here - a long while ago I tried to get into parsing sentences to achieve good 'fuzzy' matching, but I ended up spending a lot of time going nowhere.  Being a mono-linguist doesn't help, obviously, but even trying the technique when 'translating' technical babble into something lay-person understandable, using grammatical analysis didn't significantly help fuzzy matching despite the rather excessive up-front preparation.

With OmegaT I opted for an idiot solution - one a computer is good at as it requires no understanding, only indexing and plenty of boring computation.  The first step of this algorithm involves building an index of all words within a project and where those words occur (for example, the word "butter" might occur in segments 3, 88 and 91, so the index has "butter: 3, 88, 91").  For fast access, each word is stored in a hash table as opposed to a tree or list - this eats a lot of memory but provides excellent performance (which is important to scalability)
When making a 'fuzzy' match, I take a candidate string and perform a search on each word in that string within the entire project space and quickly end up with a list of segments that include one or more words from the candidate string.  I arbitrarily drop any segment with less than half of the words the same and then continue processing this string subset.

Two strings having many of the same words doesn't necessarily make them similar - what I've found tends to make them similar is the number of identical word pairs shared between two strings (at least this seems to hold in English).  I then analyze the word pairs in the candidate string and the remaining segments.  From this analysis, OmegaT then displays the "fuzzy" match strings for the translator to see, highliting where words are added and removed (the source and target strings show what is unique to each) and also what words are similar but that don't have the same neighboring words.

For example, these two similar strings have the following differences and would receive respectable matching scores (~70%) even though they are too different to practically recycle anything translations between.

If we are careful enough about what our prerequisites really are and what may be improved later, and if we keep a modular approach, I believe we can't make blatant mistakes.

If we really are careful enough about what our prerequisites are, and if we keep a modular approach, I believe we can't make mistakes.

(there is red and underline formatting in the above sentences - if it's stripped out, the comparisons might not make much sense)
Two sentences sharing identical words but in different orders would likely mean very different things in English (and most other languages as well, except Latin) and would have very low matching scores.

If our blatant approach and prerequisites are improved enough, and we can't believe what careful modular mistakes we keep, we really may ...
(I've run out of gas on this ridiculous sentence built out of the same words - I hope you get the point though)

Not science, magic or even high-tech.  It does seem to work reasonably well, however, and because all processing is done when the project opens, the information is immediately available to the translator when a segment comes up for translation - the delay should be 'constant' (and less than half a second) no matter how large the TM may be.

This "fuzzy matching" solution doesn't not lend itself very well to environments where the user is able to move the segment markers - it can be accomodated (I have the technique) for but it will give the developer a serious migrain while trying to make it possible.

BTW - I'm not subscribed to the list so if someone wishes to contact me, please do so directly.

Keith

reply via email to

[Prev in Thread] Current Thread [Next in Thread]