freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Bilingual Document Working Format


From: Henri Chorand
Subject: [Freecats-Dev] Bilingual Document Working Format
Date: Thu, 06 Feb 2003 01:24:13 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi all,

A major piece of information which we must quickly provide to Open Office team is our own bilingual document working format (BDWF).

They need it in order to help us designing and coding the conversion filters which will enable Free CATS translation clients to convert files in any of the source formats we target into BDWF format (before translation) and backwards (after translation).

I talked about it with Thierry Sourbier, an XML expert who kindly helps us on this project (see his most valuable Web site, at http://www.i18ngurus.com), and he suggested to look at TMX.

TMX was not originally designed to answer this need, but it does indeed look like a nice candidate - basically, it roughly does what we need, that is, sequentially storing nicely encapsulated translation units (TU). See at http://www.lisa.org/tmx/tmx.htm.

We did not take the time to examine its specification in detail, but this is currently our highest priority. Here is how a TU looks like in a TMX file (a Unicode text file):

<tu creationdate="20021125T130103Z" creationid="ABC"
changedate="20021203T115845Z" changeid="DEF">
<tuv lang="EN-US">
<seg>Click OK when prompted to confirm the deletion.</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Cliquez sur OK pour confirmer la suppression.</seg>
</tuv>
</tu>

As indicated by their names, creationdate and changedate are timestamps that respectively store a TU's creation & last update date & time. Similarly, creationid and changeid that respectively store a TU's creation & last update user ID. The two lang parameters store the language and country. For these fields, we have decided to follow ISO 639-2 (3-letter terminology language code) and ISO 3166 (3-letter country code).
See at:
http://www.loc.gov/standards/iso639-2/englangn.html
http://www.davros.org/misc/iso3166.html#existing

Apart from all these, we'll add a few simple fields (their default value being empty), such as:
TU category (string value)
TU sub-project (string value)
TU status (integer value)

Note that TMX file format also includes a simple header. This is where some work might remain - one of the header's field, datatype, seems to refer to various TMX flavours.


Now, we can also add the following about the contents of the seg fields.

In order to enable parsing, as in XML, before being inserted as seg data (source or target), any "<" and ">" character is to be converted into its XML equivalent. This is only when it's part of the text contents, not when it's used for a tag.

We plan to include all formatting tags (which we translators call internal tags) within TUs, but structure tags, called external tags, must remain outside TUs. Here are a few questions which we hope Open Office team can quickly answer.


The idea is, when converting a file from any of the supported "rich text" type format, to map its formatting info (whatever they look like) into BDWF's own ones. We must therefore select a standard which is sophisticated enough to take the most common layout attributes into account. Our present idea is to select HTML 4.01 for this purpose. For instance, when a word in bold characters is found in a RTF file, somewhere in a sentence, we will represent it between <B> and </B> in the seg data of our BDWF TU.

This approach is not perfect. Sooner or later, we may encounter weird style attributes which can't be easily mapped as HTML formatting tags. For these, I suggest we convert them into something which looks more or less like an existing HTML tag, and to add yet another, specific BDWF tag which contains the true (native) format info in order to be able to restore it when converting BDWF back into the native format.

I expect that this issue will arise with proprietary desktop publishing (DTP) file formats (like XPress, Framemaker or Pagemaker). In today's professional translation world, translators nearly never directly work with these file formats (and as professional translators, we do NOT recommend it). Their contents is exported as RTF by custom filters that add specific formating tags where appropriate. So, I suggest we leave proprietary DTP support for later on, we only need to ensure we'll be able to handle them, one way or another.

Note that when the same style (as in a CSS) is applied to a whole sentence or paragraph, we should be able to keep this formatting info outside of the TU - this is what Trados does in the TMX files it generates and it seems to work quite well.

Knowing how Open Office team decided to take these issues into account will certainly help us in refining this draft of BDWF. In a way, they had to solve this issue long before us.

So, please, tell us if you think the above is flawed, and tell us more in detail which "family" of formating tags you use within Open Office's sxw file format and into which you know how to convert the large set of OO-supported document file formats. We thought about HTML because it's a well-stabilized, open and very common file format. You may also suggest alternative solutions, maybe based on OO file formats, Latex or whatever.

And, once the remaining issues detailed above are solved, I hope you will agree that Free CATS represents another, probably unplanned, way to reuse the very powerful input & output conversion filters which enable OO to read so many file formats (and which help making it a so attractive alternative to proprietary file formats into which some people would like to lock all of us).

This is a very important issue for translators. In order to reach its full potential, Free CATS needs to be able to deal with all these file formats. Only then will translators be able to switch "en masse" from proprietary CAT software. Did we tell you that Mikro$oft partially owns Trados, "historical" leader of proprietary computer-aided translation software?


Regards,

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]