[Freecats-Dev] RE: First comments from Michael

freecats-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Freecats-Dev] RE: First comments from Michael

From:	Kemper DOC (Nerim)
Subject:	[Freecats-Dev] RE: First comments from Michael
Date:	Mon, 7 Jul 2003 15:20:27 +0200
Hi,

To Free CATS Dev List readers: here is my feedback to Michael's first
(quick) comments.

> One quick comment here: if I were doing this myself, I would base
> it on my repository manager without even thinking twice.  You're
> spending a lot of time thinking about Berkeley DB/not DB, when
> you should be thinking *only* on the appropriate conceptual level
> -- i.e. the level of translation.  You don't care where things are
> stored, and if you make a decision like that you'll have no end of
> grief.

Sure. In fact, we do not care at all about the way the Translation Memory
server stores TMs, except that:
- It must be portable (Linux, Mac OSX, Windows, most current Unixes)
- It must allow us to build our custom indexes (used to look for perfect or
fuzzy matches), and these indexes must be fast - well, bloody fast if
possible ;-)

In fact, we believe there will be quite a number of adjustments to the way
these indexes will be generated (based on rules), and we expect that a given
set of rules will suit a number of natural (source) languages but will need
to be adapted for other languages, so modularity is essential here.

The data to be stored (the translation units) in TMs will be made up of
(rather short) variable length strings and some ancillary extra info:
- source segments - Unicode text with pseudo-tags (placeables)
- target segments - idem
- extra info: timestamp & author for creation & last update (plus a couple
of non-urgent items to be detailed later on: status code, project code)

These pseudo-tags always are formatting (internal) tags. The structure
(external) tags will be stored in the bilingual document files only, they
must not appear within the TUs.

These pseudo-tags are simply custom tags, like (n being an integer):
<STAND%n>       [any] standalone tag
e.g. an anchor is a standalone tag, but it's not a formatting tag, so it's a
bad example

<BEGIN%n>       [any] beginning tag [that will need to be closed further on]
e.g. bold <B>

</BEGIN%n>      [any] ending tag [that will need to be opened]
e.g. end of bold </B>
<MIDDLE%n>      [any] "middle" tag [needed somewhere in-between a pair of
begining and end tags]

Sorry if this seems crude - we only need to mark the location of ANY
(X-HTML-based) tag within the segments.

The idea behind is that at translation stage, we replace the placeable tags
found in the TM's TU by the ones found in the document contents.
This helps by:
- simplifying the storage of tags within TMs (instead of storing the X-HTML
tags "as is")
- "matching" similar, but different, formatting (e.g. one segment comes with
a bold highlight in it, if it comes back with a font change highlight, we're
still fine).


> If this is based on the repository manager component of my
> workflow toolkit, you get a lot of benefits: you can use the
> same code whether you're talking to a local (offline) translation
> memory or a remote server, you can store your data in
> whatever database is available (via the adaptor concept,
> databases can be added at any time), and so on.
>  Thus when somebody comes to you and says, why can't I
> store my stuff in MySQL, you say: of course you can.  And of
> course you scale well into organizational complexty by having
> workflow available to moderate system actions.

If the above constraints are well taken into account, you've got our
blessing, of course.


At this stage, I'd VERY MUCH like experts like Yves Champollion and Yves
Savourel (on the list) to give us some feedback about indexing. If I
remember well, Yves C. briefly stated that the above was more or less how he
did store segments in his Wordfast TMs.


> The repmgr works well with Python, so you still get to leverage
> all the stuff you can do in Python.  I think this is very much the
> way to go, personally.  For a little more info on the repmgr, see
> http://www.vivtek.com/wftk/doc/repmgr/index.html -- gah.

You most probably mean:
http://www.vivtek.com/wftk/doc/index.html

> Now that I look at that page, I realize I really need to update it;
> it's much more focused on the content-management and
> publishing aspects of the module (which were important to me
> in 2001 when I wrote that) and less so on the data-access
> aspects (which have gotten a lot more attention recently).
> Anyway, that's the way I'd go, for sure.

Why not? Anyway, it seems easier and smarter to start from (any) suitable
framework than to start from the wheel all over again. It's just that us
non-developers are not really able to pick the right choice.

> Your XML-native translation memory server would thus be a
> repository system, whether local or remote (and yes, the repmgr
> is XML-based).  I'd recommend a SOAP interface for the clients
> to use, and I see no reason not to provide both wxPython-
> based fat clients and Web-based client  logic in Zope (or some
> other Web server).

Here again, whatever already exists and has the requested features (free
software, portability, modularity for custom index design... and speed) is
absolutely fine.

> The analysis and alignment "clients" intrigue me.

The analysis client should be rather straightforward, once:
- we have the TM server up and running
- we have designed our bilingual document format
- we know how to convert format X into this bilingual document format (I
believe everybody will see that this involves a default segmentation)

The alignment client mainly is a nice interface for building up an
importation file (TMX format) from a source file and its translation. This
depends on what is performed manually...

In fact, Word and Excel power users like me already know rather well how to
build one, except possibly for some stupid details like header parameters.
So, this one can wait.


>  I can see that being rather attractive as an aspect of document
> management -- and server-based.  So I upload some documents
> to a server and have them analyzed at leisure; indexing is often
> managed in the same way.

For a number of theological (well, non-technical) reasons, we believe in:
- TMs stored on a TM server
- documents stored on a local/network filesystem.

> Well, anyway, obviously I need to absorb some of the more
> detailed aspects of your specs so far.  You'll hear more from me.

Fine.

> Oh -- another thing -- I'd really like to explore integrating these
> translation clients with OpenOffice, which I've been using lately.
> It drives me crazy not to be able to see formatted text as I'm
> working, something I don't like much about using SDLX.

Sure. Pseudo-WYSIWYG should be largely enough, in fact.

Lots of what we do or don't do with OO (i.e. integrating a translation
client from within OO) will depend on the degree of interest and help that
we receive from them.

If we are to build our interactive translation editor more or less from
scratch, we should be able to use some free browser software for that, on
top of some DéjàVu / SDLX (probably) / Foreign Desk / Catalyst etc. approach
(starting from Scintilla or similar stuff for an alpha version).

If enough developers come to help, hopefully, we'll have both.


Cheers,

Henri
[Prev in Thread]
Current Thread
[Next in Thread]
[Freecats-Dev] RE: First comments from Michael, Kemper DOC (Nerim) <=
Prev by Date: [Freecats-Dev] Re: Can I join? - YES :-))
Next by Date: [Freecats-Dev] Python; matching/indexing strategies
Previous by thread: [Freecats-Dev] Re: Can I join? - YES :-))
Next by thread: [Freecats-Dev] Python; matching/indexing strategies
Index(es):
- Date
- Thread