[Freecats-Dev] Re: Advogato, FreeCATS DB

freecats-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Re: Advogato, FreeCATS DB

From:	Henri Chorand
Subject:	[Freecats-Dev] Re: Advogato, FreeCATS DB
Date:	Sun, 26 Jan 2003 15:43:09 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Dear Charles,

I created the Savannah user name "chalst", which you can name a
developer.


Done - we're six now.

You might also like to name yourself Lead Developer on
http://advogato.org/proj/FreeCATS - just visit this page with

> the cookie saying you have logged in and you can name your
> relationship to the project.

Yes, done. I also inserted a short attempt at a CV in my personal page.

"true maths expert" ... no I'm not that.  I use maths in my work,

> but my skills are not comparable to those of a real mathematician.

Well, you still are the highest skilled for maths (among else) in theproject team, so I would be very happy if you can provide feedback onseveral subjects (like you presently do) concerning overallarchitecture, database, segments or indexing algorithms. For a number ofthe questions we translators presently raise, your experience and skillsenable you to quickly point non-optimal options and provide advice. Istopped maths at college.

I'm presently working on a draft document about indexing which assumeswe may use a "basic & small but robust" stuff LIKE Berkeley DB (orsimpler). I try to make it modular (provides room for language-specificprocessing when needed). I'll do my best to publish it ASAP (late thisevening?)

Two questions:
  - Why do you need a DB server?  Why not just use the filesystem?

Well, why not? Anyway, I'm not coding yet, and the pseudocode I havebegun writing should fit this filesystem option.

  If the answer is performance reasons, isn't this premature

>   optimisation?

I dunno. I don't see the need for RDBMS-level stuff, so I tried to picka small-footprint, intermediate-level solution between a raw filesystemand a full-fledged RDBMS, at least in order to help me develop a few ideas.

I don't mind if we end up with only a bunch of flat files cleanly setup. In fact (I'm thinking of it), a filesystem should be even faster,and we could use directories, so this might be what we need.

  The subversion folks (also a Savannah project) built a version
  controlled file system on top of Berkley DB, btw.  Demanding a DB
  server might mean we can't make use of a P2P architecture, which
  might be useful for some applications we might find interesting down
  the line.

I can't tell, I don't see P2P in this project yet, which is NOT to sayit son't prove useful some day.

Of course, let's avoid being specific if possible.

  Also Raph Levien (advogato's founder, and a very nice person)

>   has done some thinking about designing data represntations

  for very large, hierarchically structured objects that allow

>   fast seeks.

I had to try several times to access the People section of Advogato Website (probably busy), then I had a quick look at his Web site. Youprobably refer to:

http://www.levien.com/athshe/

It's a little technical for me at first but I think I can get a broadpicture.

Let me know if you have other important URLs at hand.

  - Purely string based matching can't do hierarchical structure.

>   Won't we need that to do CATS properly?

I don't really see your point here. Probably, once you read my draftdocument, you can elaborate (show me what I did / didn't and where it'sfine / broken / useless).


>   I recall a talk by Paul Hudak (Yale University), who did some
>   rough-and-ready hierarchical parsing on english and managed to
>   figure out the roles (noun, verb, determiner, etc.) of words
>   with a very good degree of accuracy and a quite simple lexicon.

I can't tell, I only collected a few opinions.

Some of the useful knowledge I have about text indexing comes from whenI was documentation manager for SPIRIT software at Technologies GID(http://www.technologies-gid.com, France), based on research done byChristian Fluhr (I've met him and also know his wife). SPIRIT was amongthe first products to perform semantic analysis, which involves a lot oflanguage-specific processing (an awful drawback if you want to add asmany languages as possible, it represents several man-months per language).

In fact, SPIRIT only ever worked well with French, English and German.Arabic was part of the first prototype built in a French university, butwas not kept updated and the last I heard is that they were working toadd Russian, but the reason (so I was told in a whisper) could have beenbecause Christian Fluhr's wife was born in Russia ;-) I never heardabout a customer in Russia. The present development team is now reallysmall and I don't think they'll make major breakthroughs - anyway it's aproprietary product.

A general lesson I heard while working there is that, for variousreasons, semantic processing in English is (much) simpler than forseveral other languages. A trivial example is how easily verbs conjugate(as you live in Dresde, you must have some knowledge of German grammar).

  I thought at the time that building the required lexicon is

>   something that could be automated using off-the-shelf AI techniques.

For now, I (and other translators in the project team) would reallyprefer avoiding any semantic level processing. Of course, if harmless,nobody will object designing our server in a way that would helpimplementing it later on.

The "trouble" also is, CAT mostly deals with technical documentation, sokeeping lexicons up to date could be a real pain and become a source oferrors if the relevant features are not well understood.

In my draft document, I try to keep language-specific bits (building upthe list of N-Grams from a TU's source segment) at a well defined place.



Regards,

Henri

[Prev in Thread]

Current Thread

[Next in Thread]

[Freecats-Dev] Re: Advogato, FreeCATS DB, Henri Chorand <=

Prev by Date: [Freecats-Dev] Source segments indexing method
Next by Date: Re: [Freecats-Dev] Source segments indexing method
Previous by thread: [Freecats-Dev] Source segments indexing method
Next by thread: [Freecats-Dev] TM data structures and Indexing (first attempt)
Index(es):
- Date
- Thread