freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Re: Advogato, FreeCATS DB


From: Henri Chorand
Subject: [Freecats-Dev] Re: Advogato, FreeCATS DB
Date: Sun, 26 Jan 2003 15:43:09 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Dear Charles,

I created the Savannah user name "chalst", which you can name a
developer.

Done - we're six now.

You might also like to name yourself Lead Developer on
http://advogato.org/proj/FreeCATS - just visit this page with
> the cookie saying you have logged in and you can name your
> relationship to the project.

Yes, done. I also inserted a short attempt at a CV in my personal page.

"true maths expert" ... no I'm not that.  I use maths in my work,
> but my skills are not comparable to those of a real mathematician.

Well, you still are the highest skilled for maths (among else) in the project team, so I would be very happy if you can provide feedback on several subjects (like you presently do) concerning overall architecture, database, segments or indexing algorithms. For a number of the questions we translators presently raise, your experience and skills enable you to quickly point non-optimal options and provide advice. I stopped maths at college.

I'm presently working on a draft document about indexing which assumes we may use a "basic & small but robust" stuff LIKE Berkeley DB (or simpler). I try to make it modular (provides room for language-specific processing when needed). I'll do my best to publish it ASAP (late this evening?)


Two questions:
  - Why do you need a DB server?  Why not just use the filesystem?

Well, why not? Anyway, I'm not coding yet, and the pseudocode I have begun writing should fit this filesystem option.

  If the answer is performance reasons, isn't this premature
>   optimisation?

I dunno. I don't see the need for RDBMS-level stuff, so I tried to pick a small-footprint, intermediate-level solution between a raw filesystem and a full-fledged RDBMS, at least in order to help me develop a few ideas.

I don't mind if we end up with only a bunch of flat files cleanly set up. In fact (I'm thinking of it), a filesystem should be even faster, and we could use directories, so this might be what we need.

  The subversion folks (also a Savannah project) built a version
  controlled file system on top of Berkley DB, btw.  Demanding a DB
  server might mean we can't make use of a P2P architecture, which
  might be useful for some applications we might find interesting down
  the line.

I can't tell, I don't see P2P in this project yet, which is NOT to say it son't prove useful some day.
Of course, let's avoid being specific if possible.

  Also Raph Levien (advogato's founder, and a very nice person)
>   has done some thinking about designing data represntations
  for very large, hierarchically structured objects that allow
>   fast seeks.

I had to try several times to access the People section of Advogato Web site (probably busy), then I had a quick look at his Web site. You probably refer to:
http://www.levien.com/athshe/
It's a little technical for me at first but I think I can get a broad picture.
Let me know if you have other important URLs at hand.

  - Purely string based matching can't do hierarchical structure.
>   Won't we need that to do CATS properly?

I don't really see your point here. Probably, once you read my draft document, you can elaborate (show me what I did / didn't and where it's fine / broken / useless).

>   I recall a talk by Paul Hudak (Yale University), who did some
>   rough-and-ready hierarchical parsing on english and managed to
>   figure out the roles (noun, verb, determiner, etc.) of words
>   with a very good degree of accuracy and a quite simple lexicon.

I can't tell, I only collected a few opinions.

Some of the useful knowledge I have about text indexing comes from when I was documentation manager for SPIRIT software at Technologies GID (http://www.technologies-gid.com, France), based on research done by Christian Fluhr (I've met him and also know his wife). SPIRIT was among the first products to perform semantic analysis, which involves a lot of language-specific processing (an awful drawback if you want to add as many languages as possible, it represents several man-months per language).

In fact, SPIRIT only ever worked well with French, English and German. Arabic was part of the first prototype built in a French university, but was not kept updated and the last I heard is that they were working to add Russian, but the reason (so I was told in a whisper) could have been because Christian Fluhr's wife was born in Russia ;-) I never heard about a customer in Russia. The present development team is now really small and I don't think they'll make major breakthroughs - anyway it's a proprietary product.

A general lesson I heard while working there is that, for various reasons, semantic processing in English is (much) simpler than for several other languages. A trivial example is how easily verbs conjugate (as you live in Dresde, you must have some knowledge of German grammar).

  I thought at the time that building the required lexicon is
>   something that could be automated using off-the-shelf AI techniques.

For now, I (and other translators in the project team) would really prefer avoiding any semantic level processing. Of course, if harmless, nobody will object designing our server in a way that would help implementing it later on.

The "trouble" also is, CAT mostly deals with technical documentation, so keeping lexicons up to date could be a real pain and become a source of errors if the relevant features are not well understood.

In my draft document, I try to keep language-specific bits (building up the list of N-Grams from a TU's source segment) at a well defined place.


Regards,

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]