freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Re: Trados/other CAT, Python/Java, German/English


From: Henri Chorand
Subject: [Freecats-Dev] Re: Trados/other CAT, Python/Java, German/English
Date: Tue, 25 Feb 2003 01:05:45 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Dear Charles,

Congratulations on the excellent work you are doing as project
> leader.FreeCATS has acquired a great deal of momentum in its
> short period of existence, and I think most of the credit for
> that can be put at your door.

Well, thanks... what I tried to do was mainly information search & communication. Up to now, (I had read about it but it's striking to see how is proves valid) I'm amazed at how much of this project is about:
- looking for existing projects
- assessing them (especially compatibility between them and with
  Free CATS's goals)
- trying to make people with very different backgrounds cooperate.

> (...)
So import and export filters for TMX is a must.  Are there any
descriptions of this file format available?

As found in the link list at the last page of Development Roadmap document:
http://www.lisa.org/tmx/

Seeing from Yves Champollion's latest feedback, not all CAT tools may know how to handle well its last flavours, but to date, it remains one of the few readily implemented open standards which allow true interoperability between existing tools, like between Trados & Wordfast for instance.

(...)
 - I am more worried about lock-in, especially wrt. the
 Python programming language.  It is an excellent tool
 for doing quick hacks that need OO, but it behaves almost
 completely unlike any other programming language in its
 semantics.  If the reference implementation is Python,
 we will find it difficult to support programmers who loathe
 Python (and they do exist).

Of course - that's the trouble with having to choose a language ;-)

Certainly, but it doesn't follow that all languages are equal.
> The point I wish to make is that Python is something of an outlier
> in the family of languages, and while it is quite intuitive and
> flexible, not all the claims that the Python language enthusiasts
> make for it should be taken at face value.

Sure. Tcl also claims similar advantages, but I ended up feeling it was too lightweight and a bit weird. Maybe these drawbacks may also apply to Python to a lesser extent.

One of the (now extinct) languages I programmed the most with was Pick's Basic. I remember how it was very handy to manipulate strings with it, but I have nothing against a strongly typed language like C.

It has one of the strangest approaches to variable extent that I
> have ever seen, one that is often misdescribed as `lexical scoping'
> but would be better described as `a lexical tower of dynamic scopes';
> the approach is also quite expensive in terms of demanding many
> run-time dictionary lookups, and while good results have been
> achieved with a ruthlessly-optimising Python compiler, I think a
real performance penalty will be paid if we adopt python as our
scripting language as opposed to Tcl or Perl (Tcl has an excellent
compiler, and AFAIK Perl's compiler gets much better results than
Python's).
>>>  - I am all for a Java implementation.  Java has excellent
>>>  libraries, and many PLs can target the JVM, including
>>>  python, tcl and Scheme.

Anyway, I strongly hope Free CATS team will start work from an existing
project (OmegaT is the actual favourite), and if it's the case, then
we'll simply continue along the same lines, and the language will have
been already chosen.
As OmegaT is written in Java, that would settle it and you would be happy with this option ;-)

Java isn't my favourite language, but agreeing on the Java runtime
doesn't preclude coding in another language.  An option is to use
Jython (the Python-on-JVM implementation) to code quick hacks and
rewrite in Java.  The language I am most productive in, Scheme, has
> an excellent JVM implementation, namely SISC.  I don't know of a good
Perl implementation on the JVM, but that may just be my ignorance.

Jython would be handy in that we should probably be able to quickly test a number of things with Python, while keeping the bulk of the code in Java. I won't dwelve further into the more detailed background info you provide, because I'm not competent for this.

(...)
I would like to code, but I have a rather full timetable over the next
two months and another free software project commitment which has
priority for me at the moment, so it is difficult for me to promise
anything definite in advance.  Also how much contribution I make will
depend on which language is adopted. Despite my reservations above, I
would be willing to code in Python.

(...)
...but if *I* am the one to start coding, I will almost certainly code
in SISC (ie. the scheme-on-JVM I mentioned before), and I will not be
starting in the next few weeks.

As I see it, none of us should start coding now. My personal time is also limited, running a small & busy company takes at least 50 hours a week - and I also happen to be the happy father of 3 girls aged 11, 7 & 7 (the latter being twins, as you will have guessed).

The most important task seems to me to continue assessing other free software projects in order to determine which one we may start from.

If we are careful enough about what our prerequisites really are and what may be improved later, and if we keep a modular approach, I believe we can't make blatant mistakes.

Apart from the technical features, the openess of mind and willingness to cooperate in order to achieve the best possible solution are of prime importance, and I consider the nice & high quality feedback provided for instance by Keith Godfrey and Yves Champollion are very promising. If we can establish a true cooperation, then we can be sure we'll make a great product, and it may require a quite reasonable amount of NEW coding, at least for a start - you'll remember what I said, we only have to begin with what's available and make it work on a modular basis, after which lots of other volunteers will come much more easily.

(...)
 - I am assembling an argument that we will need to handle
 hierarchical structure to get results with German->English
 translation.  More to follow, not necessarily all that soon.

Great - at least somebody kep working today while all of us talked.

I'm not sure what you mean with "hierarchical structure" here, but I
suppose I'll just have to wait a little and I'll see it. I hope it
nicely fits in the picture as a clever indexing feature on top of raw
segment storage in a TM.

Hope to send a message on this later this week.  By hierarchical
structure I mean the parse trees that linguists represent using
> x-bar grammars, eg.:

                Sentence
                /       \
        Noun phrase     Verb phrase
        /       \       /       \
    Determiner  Noun  Verb      Noun phrase
        |        |      |        /      \
      The       cat    sat      Prep    Noun phrase
                                 |      /       \
                                on    Det       Noun
                                       |         |
                                      the       mat

Thanks for your feedback. Am I right in recognizing what I call a semantic layer? I know this domain a bit, as I worked around two years altogether on [the documentation of] SPIRIT software - you might have heard about it, because it was a pioneer in electronic document management systems.

One of the lessons I learnt is that while very interesting, this kind of approach represents a huge amount of work for each single natural language added up to such a system.

To the best of my limited knowledge, it also requires building up a dictionary for each such language, in order to be able to recognize words and therefore to assign each of them a category (noun, verb, etc.). Things also tend to get worse with technical documentation in that you keep adding new entries in these dictionaries nearly all the time (for each flavour of technical jargon you happen to be translating).

I consider this kind of approach may come at a later stage, and I hope what Free CATS server will end up into can be compatible with such a system, but I believe most team members will agree that it should be done later on. I hope I don't sound pessimistic on this issue. Let me know if I missed your point.

Of course, the question of taking into account the various forms of a given word (conjugation for verbs, singular/plural forms for nouns & adjectives) might make us end up with such a *simplified* semantic layer. This is what we wanted to avoid by using N-Grams (words' sub-strings) so as to use statistics' brute force ("cat" being a substring of "cats") instead of an "intelligent" processing.

I hope Keith will soon detail us how he did things with OmegaT.


Cheers,

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]