freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freecats-Dev] Radio silence


From: Yves Champollion
Subject: Re: [Freecats-Dev] Radio silence
Date: Thu, 8 May 2003 21:54:51 +0200

I reprint a address@hidden post that, I believe, has some
relevance here

> From a certain "Samuel":
> I've asked a similar question before.  OpenOffice boasts a powerful macro
language, they say, and I've managed to import one or two MS Word macros
into it.  But Wordfast is not one of them.  How about it, Yves?  Even if you
could give us an early, outdated version of Wordfast (presumably written in
Wordbasic), exported to OpenOffice's macro language, it would be a great
boon.  At the moment only OmetaT works with OpenOffice, and OmegaT is a
loooooooong way from completion.

Yves Champollion:
There are two essential parts in any word-processor-based translation tool:
1. the segmenter 2. the translation memory engine proper.

(tools that are not word-processor-based like DéjàVu or Transit have two
major parts: 1. filters that import/export data without hurting format - a
formidable challenge. 2. the TM engine proper).

Part 2 is not difficult. It's chartered ground. Algorithms already exist,
they're taught in computing courses, used and re-used in various search
engines, database systems. They need minor tweaking to be put to use for TM.
A computing graduate would write & debug a decent TM engine in a couple
months.

Part 1. is a real challenge. Craftmanship, in a totally unknown form of art.
It takes nearly paranoid obstination. A TM tool needs to work across
languages for obvious reasons. Plus, every word processor has its own logic
(worse, every version of the same word processor has its own logic. A
closing quote in Word/Mac could be an accented character in a PC.
Unfortunately, that you have a closing quote or a regular letter makes a
great difference when doing segmentation or linguistic recognition. A letter
is *within* text, a quote *delimits* text. Then you move into Cyrillic or
Greek or Chinese or Hebrew or Polish and then it's a nightmare, add Unicode
and this could drive you crazy. Chinese older double-byte system could send
into madhouse).

So to write a segmenter that sits atop a particular wordprocessor, and make
this segmenter reliable (what about tagged documents? tables? fields?
graphics? hidden text? 10-megabyte documents?) is one real challenge. The
only way to do it is an endless trial-and-error with partners in every
language and platform, patient enough to provide feedback. It takes years of
daily trial-and-error to produce a decent segmenter. At times there were so
many patches added to my initial code, I had to lock myself for a week and
re-write everything - whence the occasional, temporary lapses in
reliability, then moving up to a new level. The recent 60+ builds that led
from version ST to version 4.20 are the latest episode).

Porting WF to OpenOffice essentially means re-writing the segmenter for
another word-processor, understanding how that W-P works, catering to all
possible variations. The translation community that relies on OO is so small
I would never get the feedback I got with Word. Yet I am willing to do it,
mainly because I know all the pitfalls and how to avoid them. Two essential
questions remain:

1. Time
2. Has OO's macro language what it takes? (you may criticize MS-VBA's so-so
reliability but I have heard hair-raising reports from people having
seriously tried to write anything with OO's macro language). But, I'm
willing to try, yes.

For the rest, like porting the TM engine to Java so it works alongside OO's
segmenter, no problem. A classical porting question, a mere bucket of sweat,
no more.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]