freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freecats-Dev] Source segments indexing method


From: Henri Chorand
Subject: Re: [Freecats-Dev] Source segments indexing method
Date: Sun, 26 Jan 2003 17:08:51 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Charles Stewart wrote:
A general question: how ambitious is the CATS going to be?

As we don't have many developers yet, I beliewe we can agree on first building a working prototype with a bare bone server and translation client, just to demonstrate we can make it and to attract more people.

Even if tomorrow, IBM or else knocks at the door to offer "unlimited" resources, I still think we must take one of our project's features (starting from user requirements) as an asset, so as to carefully provide specifications centered around CAT so as to address translators' needs.

I sometimes tend to consider natural language processing as a potential nest of vaporware. I mean, I started Free CATS because I felt it was urgent from stopping Mikro$oft-owned Trados to rule CAT world; otherwise, I would never have dared starting this project and might be doing less ambitious things to help free software - like helping to translate interesting software to help it spread more, if only more end-user tools were already available we would not be trying to make one ;-)

Are we going to model the hierarchical structure of language,
> or do we think this task is too hard?
> If we don't, how are we going to spot common idioms such
as "neither... nor..."?

This is what I call "semantic level processing" in my previous post. I really have nothing against it, but from my little working experience in this field, I know it's even more ambitious than CAT.

When I worked as documentation manager at the SPIRIT software's editor, we used to call such words (articles, adverbs, auxiliary verbs, etc.) as "tool words" ("mots outils") and they were not indexed.

This distinction is quite less relevant for CAT, as the translator uses CAT to consistently translate similar sentences. Imagine several similar short sentences where only a key term differs (for instance "user code", "client code", "supplier code" etc.):
The user code field is now highlighted.
The client code field is now highlighted.
(...)
If we compare these two sentences, 6 out of 7 words are identical and the sequence of identical words is the same between them, so the fuzzy rate is going to be quite high (6/7*100, around 86% to simplify).

If you remove N-Gram index entries which correspond to these tool words, the translator can't expect to retrieve fuzzies as well:

The user code field is now highlighted.
The client code field is now highlighted.

If we erase the tool words, the two sentences become:

user code field highlighted.
client code field highlighted.
Fuzzy rate is now around (3/4*100, 75% to simplify).

Sorry for being long, I wanted to be clear for non-coders.

I'll just restrict my comment to the Source segments indexing (SSI)
method at the moment.  I've attached some excerpts from earlier emails
to the end of this message:

        #1. What componenets of FreeCATS make use of the SSI method?

In my draft indexing specs document, only the server.

        Is it only to be used in building the corpus for the
        Translation Memory server, or do we use it as a preprocessor
        in translating text?

In a classic CAT tool, we don't have such preprocessors.

        #2. I agree with David Welton about starting with European for
        now, but I think we should make an effort to attract someone
        who knows asian character sets.  I don't think we should figure
        this stuff out for ourselves, if none of us speaks an asian
        language.  We shouldn't wait too long: if we work only with
        Indo-European languages, we might have some nasty surprises
        when we find that Korean, say, violates some assumptions we
        thought applied to all language texts;

True. See the draft document (when available), I think it's OK at least for Chinese, but I'm dubious with Thai and all languages which don't clearly separate words within a sentence. Still, maybe Thai computer users (also translators) may have started another way of dealing with this problem (I mean, how could one even implement a spell check in Thai?).

I can try to contact two French localization companies specialized in Eastern languages and ask for help, but ideally, we would get more assistance from language scientists. There is a newsgroup in French, fr.sci.linguistique, on which we could post for help, and we might also ask where to post in other languages (I can't find any similar one in English on Usenet but maybe I don't know where to search).

        #3. Unicode character properties: clearly it is the right thing
        to use these;

Fine, this is one of our non-ambiguous areas.

        #4. I think it is better to work directly from the source text:

We thought about it at our last Breton meeting in Quimper, and I can't pretend our proposal is a final one.

We rejected it for a variety of reasons:
- As we want the server to only manage a TM (performance), we prefer converting a source file into our bilingual working format (still to be defined in detail, will be Unicode-based and probably look a lot like HTML strings embedded in custom tags with a few extra info) - We want the translator to retain control of the project files locally and avoid any taylorist web-service approach where the translator cannot, for instance, manually process the files one way or another. - As we want to work with a variety of formats, the only reasonable thing to do is to convert any of these (back and forth) into a (bilingual) working format, otherwise even the server has to learn each of these formats, and we need/want a lot of them: text-only, flat resource file (Windows .RC / .H, WinHelp & HTML Help source files, HTML (even if badly formatted), XML (idem), RTF, Open Office, Latex... without speaking about MS Office or proprietary DTP files.

(...)
with case-folded texts, we work, eg. with case-insensitive matches.

Yes, we'll probably un-capitalize letters, but we have to keep accents.

     - We will lose potentially valuable information if we
     do things like throw away email headers;

Any info embedded within a tag will be kept in the bilingual working format (source & target segments) but could be dropped in the TM if we choose to use simplified internal tags.

(...)
        #5. N-grams: easy to do this if we represent the lexicon using
        a state-transition diagram or even a recursive descent parser
        (the best are almost as fast as lexing regexps).

I need to be better acquainted with these terms

        #6. I'm against using fuzzy matching: if we build up a big
        corpus in a language, then we will have almost all actually
        occurring misspellings in that language.  Exact matching is
        much faster than fuzzy matching, and easier to design around.

Well, CAT also is about fuzzy matching :-)

More up to the point, building up a large TM from legacy materials always brings up questions about their quality. If, tomorrow, I'm contacted by the project team of a major free software project who is interested about Free CATS (and I hope it will happen), I'll advise them to fully review their legacy translations, especially now that they have a better tool at hand.

In my work at Kemper DOC, major localization agencies often propose us to work on projects in which the legacy MT is half-filled with rubbish, or simply inconsistent with the project terminology, so we try to do a better work - more like handicraft than with a fully automated tool, but in the end, the work gets done, and we could not do it as easily without CAT and fuzzy matching. It's a genuine Real Life Situation :-(

In a nutshell, if the user is not a qualified translator, to parody Murphy's Laws, even without garbage in (read: legacy), if a donkey takes control over the handles, garbage out is not far away.

But of course, fuzzy matching ONLY comes into play if we have no exact match to retrieve in the TM - or when we do a context search.

This is also an instance where we can first provide a crude function, to be refined later on.

For instance:
The user code field is now highlighted.

Suppose I want to check how "user code" is translated in the TM (it's not included in the terminology database used, or this project does not use one), I'll highlight this string frol within the translation client and press a function key. The translation client will display, within a separate window, all sentences which contain this string, so that I can provide a consistent translation, even if there is no perfect match, if there is no fuzzy match (or none below my current threshold anyway), or if the only fuzzy matches do match some other part of my current segment than the "user code" bit in it.


Pardon me for being long and forgive my rants!

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]