freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Translation Units indexing - a first draft (cont.)


From: Henri Chorand
Subject: [Freecats-Dev] Translation Units indexing - a first draft (cont.)
Date: Thu, 23 Jan 2003 19:17:22 +0100

> 1) Unicode character properties should be used to determine if
> a character is a letter/digit/punctuation mark.

Sure. The chart on http://www.stixfonts.org/charactertable.html gives a
rough idea.

> 2) Word breaking is a challenging problem (think Japanese, Thai...).

As I see it, it's not too much a problem with Chinese or Japanese:
kana (hiragana & katakana) use alphabets, so it should be dealt with like
for other alphabet-based languages.
See at:
http://kanji.free.fr/tabl_kana.php3?type=hira
http://kanji.free.fr/tabl_kana.php3?type=kata
Well, I know it's in French, but who cares at this stage :-)
kanji use (possibly adapted) (differently articulated) Chinese characters.
Mainland China uses simplified Chinese characters, Taiwan uses traditional
Chinese ones.
Chinese characters (ideograms) are in fact one-"letter" words, so each
ideogram will be indexed without even needing to extract n-grams from it.

Thai is much more a problem in that, at least traditionally, all words
(which are mono-syllablic) are glued together to form a sentence.

> 3) You'll need to apply some kind of normalization such as case folding
> otherwise: "text" and "TEXT" will never fuzzy match. What about accents?

I would say no for accents (let's consider them as they are in any character
code set, different from the corresponding non-accented letters).
And what about accented capital letters (can anybody confirm they are
specific characters in Unicode, like in ANSI?) (see below about case
management)

Also remember that in French, for instance, accented and non-accented
letters are the ONLY difference between different words (with different
meanings), like in:
"hue" ("gee up", to a horse) and "hué" ("jeered")
"rue" (street) and "rué" (kicked)
(sorry if these are the only examples my horse and I can think of).

Case might be different. We should interpret a case difference as a "less
close" match, but we might also:
- ignore them during all indexing steps (index entries normalized into all
lower-case strings)
- re-inject this difference at the final stage (so as to lower match
values).

Thierry and others, let me know what you think of it.

> 4) If you want the TM to scale only store in the index the smallest
> "sub-words" (N-Gram in the litterature)
>
> Otherwise for a word of 8 characters that's 15 words and subwords
> that you'll need to store.
> 1 - 8 character "word"
> 2 - 7 characters "sub-words"
> ...
> 5 - 3 characters "sub-words"
>
> Even if memory is cheap we shouldn't waste it :).

Yes, sure. Remember I attempted a first draft :-)

> Given 2) and 4) may be you don't care about words anymore and just
> split quickly the TU into N-Grams. It worked well for us :).

I LOVE this idea.

Yes, in fact, let's look at it this way:
we don't need to say that a full word deserves (in itself) a better score
than one of its N-Grams, ONLY BECAUSE it will SCORE HIGHER than one of its
N-Grams (when compared to the same full word):
word a = (matches) word b
DOES MEAN that
all N-Grams of word a match those of word b
So the total score is higher
CQFD (Quod Erat Demonstrandum)

Sorry for shouting, but I'm quite happy with the whole idea. Now, this is
what brainstorming is about.

> Optimization: All N-Grams are not born equal and some are
> more frequent than other in each language. Based on this some
> optimization methods can be used
> to filter out the less relevant ones e.g. "ing" in English.

Well, this is a language-specific process.

I suggest, let's try to see if it improves things (apart from saving disk
space), ONCE a more basic implementation is done.

About disk space: how the DBMS stores data & index may be more important
than the way we're able to discard some sub-strings as index entries.


Regards,

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]