[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Freecats-Dev] Bilingual Document Working Format
From: |
Henri Chorand |
Subject: |
[Freecats-Dev] Bilingual Document Working Format |
Date: |
Thu, 06 Feb 2003 01:24:13 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 |
Hi all,
A major piece of information which we must quickly provide to Open
Office team is our own bilingual document working format (BDWF).
They need it in order to help us designing and coding the conversion
filters which will enable Free CATS translation clients to convert files
in any of the source formats we target into BDWF format (before
translation) and backwards (after translation).
I talked about it with Thierry Sourbier, an XML expert who kindly helps
us on this project (see his most valuable Web site, at
http://www.i18ngurus.com), and he suggested to look at TMX.
TMX was not originally designed to answer this need, but it does indeed
look like a nice candidate - basically, it roughly does what we need,
that is, sequentially storing nicely encapsulated translation units
(TU). See at http://www.lisa.org/tmx/tmx.htm.
We did not take the time to examine its specification in detail, but
this is currently our highest priority. Here is how a TU looks like in a
TMX file (a Unicode text file):
<tu creationdate="20021125T130103Z" creationid="ABC"
changedate="20021203T115845Z" changeid="DEF">
<tuv lang="EN-US">
<seg>Click OK when prompted to confirm the deletion.</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Cliquez sur OK pour confirmer la suppression.</seg>
</tuv>
</tu>
As indicated by their names, creationdate and changedate are timestamps
that respectively store a TU's creation & last update date & time.
Similarly, creationid and changeid that respectively store a TU's
creation & last update user ID.
The two lang parameters store the language and country. For these
fields, we have decided to follow ISO 639-2 (3-letter terminology
language code) and ISO 3166 (3-letter country code).
See at:
http://www.loc.gov/standards/iso639-2/englangn.html
http://www.davros.org/misc/iso3166.html#existing
Apart from all these, we'll add a few simple fields (their default value
being empty), such as:
TU category (string value)
TU sub-project (string value)
TU status (integer value)
Note that TMX file format also includes a simple header. This is where
some work might remain - one of the header's field, datatype, seems to
refer to various TMX flavours.
Now, we can also add the following about the contents of the seg fields.
In order to enable parsing, as in XML, before being inserted as seg data
(source or target), any "<" and ">" character is to be converted into
its XML equivalent. This is only when it's part of the text contents,
not when it's used for a tag.
We plan to include all formatting tags (which we translators call
internal tags) within TUs, but structure tags, called external tags,
must remain outside TUs. Here are a few questions which we hope Open
Office team can quickly answer.
The idea is, when converting a file from any of the supported "rich
text" type format, to map its formatting info (whatever they look like)
into BDWF's own ones. We must therefore select a standard which is
sophisticated enough to take the most common layout attributes into
account. Our present idea is to select HTML 4.01 for this purpose. For
instance, when a word in bold characters is found in a RTF file,
somewhere in a sentence, we will represent it between <B> and </B> in
the seg data of our BDWF TU.
This approach is not perfect. Sooner or later, we may encounter weird
style attributes which can't be easily mapped as HTML formatting tags.
For these, I suggest we convert them into something which looks more or
less like an existing HTML tag, and to add yet another, specific BDWF
tag which contains the true (native) format info in order to be able to
restore it when converting BDWF back into the native format.
I expect that this issue will arise with proprietary desktop publishing
(DTP) file formats (like XPress, Framemaker or Pagemaker). In today's
professional translation world, translators nearly never directly work
with these file formats (and as professional translators, we do NOT
recommend it). Their contents is exported as RTF by custom filters that
add specific formating tags where appropriate. So, I suggest we leave
proprietary DTP support for later on, we only need to ensure we'll be
able to handle them, one way or another.
Note that when the same style (as in a CSS) is applied to a whole
sentence or paragraph, we should be able to keep this formatting info
outside of the TU - this is what Trados does in the TMX files it
generates and it seems to work quite well.
Knowing how Open Office team decided to take these issues into account
will certainly help us in refining this draft of BDWF. In a way, they
had to solve this issue long before us.
So, please, tell us if you think the above is flawed, and tell us more
in detail which "family" of formating tags you use within Open Office's
sxw file format and into which you know how to convert the large set of
OO-supported document file formats. We thought about HTML because it's a
well-stabilized, open and very common file format. You may also suggest
alternative solutions, maybe based on OO file formats, Latex or whatever.
And, once the remaining issues detailed above are solved, I hope you
will agree that Free CATS represents another, probably unplanned, way to
reuse the very powerful input & output conversion filters which enable
OO to read so many file formats (and which help making it a so
attractive alternative to proprietary file formats into which some
people would like to lock all of us).
This is a very important issue for translators. In order to reach its
full potential, Free CATS needs to be able to deal with all these file
formats. Only then will translators be able to switch "en masse" from
proprietary CAT software. Did we tell you that Mikro$oft partially owns
Trados, "historical" leader of proprietary computer-aided translation
software?
Regards,
Henri
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Freecats-Dev] Bilingual Document Working Format,
Henri Chorand <=