freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freecats-Dev] Translation Units indexing - a first draft


From: David N. Welton
Subject: Re: [Freecats-Dev] Translation Units indexing - a first draft
Date: 22 Jan 2003 18:42:10 -0800
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2

"Thierry Sourbier" <address@hidden> writes:

> 1) Unicode character properties should be used to determine if a
> character is a letter/digit/punctuation mark.

Tcl deals with this out of the box, in theory:

string is punct $foo

        punct     Any Unicode punctuation character.

> 2) Word breaking is a challenging problem (think Japanese, Thai...).

It also has a 'string wordend'.

Don't know if these do what they should for Asian character sets,
although my inclination on any open source project is to get it
running first.  Maybe that means getting the project started with
European languages, and then mixing in others.  This has the
disadvantage that you might have to rework things later, but at least
you get something people can use and then they get interested in your
project...

> 4) If you want the TM to scale only store in the index the smallest
> "sub-words" (N-Gram in the litterature)

This is not really my department:-)

-- 
David N. Welton
   Consulting: http://www.dedasys.com/
     Personal: http://www.dedasys.com/davidw/
Free Software: http://www.dedasys.com/freesoftware/
   Apache Tcl: http://tcl.apache.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]