freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Source segments indexing method


From: Charles Stewart
Subject: [Freecats-Dev] Source segments indexing method
Date: Sun, 26 Jan 2003 07:32:01 -0500 (EST)

A general question: how ambitious is the CATS going to be?  Are we going
to model the hierarchical structure of language, or do we think this
task is too hard?  If we don't, how are we going to spot common idioms such
as "neither... nor..."?

I'll just restrict my comment to the Source segments indexing (SSI)
method at the moment.  I've attached some excerpts from earlier emails
to the end of this message:

        #1. What componenets of FreeCATS make use of the SSI method? 
        Is it only to be used in building the corpus for the
        Translation Memory server, or do we use it as a preprocessor in
        translating text?
        #2. I agree with David Welton about starting with European for
        now, but I think we should make an effort to attract someone
        who knows asian character sets.  I don't think we should figure
        this stuff out for ourselves, if none of us speaks an asian
        language.  We shouldn't wait too long: if we work only with
        Indo-European languages, we might have some nasty surprises
        when we find that Korean, say, violates some assumptions we
        thought applied to all language texts;
        #3. Unicode character properties: clearly it is the right thing
        to use these;
        #4. I think it is better to work directly from the source text:
        it might sound like a harder problem to work with raw source
        files without any preprocessing, but:
                - It isn't as hard as it sounds.  Rather than work, eg.
                with case-folded texts, we work, eg. with case-insensitive
                matches.  When case is useful, it is there to use (eg.
                in german all nouns are captalised, and can be the best
                token to distinguish otherwise ambiguous words);
                - We will lose potentially valuable information if we
                do things like throw away email headers;
                - We can be smarter without it.  Eg. if we translate an
                apparently english-language email into french, the
                preprocessor is unlikely to be smart enough to spot
                a C program fragment hidden in the body, but the translation
                software can be.  Let's call this the "envelope problem":
                figuring out all the ways in which to-be translated text
                might be interwoven with to-be-passed-on verbatim text.
        #5. N-grams: easy to do this if we represent the lexicon using
        a state-transition diagram or even a recursive descent parser
        (the best are almost as fast as lexing regexps).
        #6. I'm against using fuzzy matching: if we build up a big
        corpus in a language, then we will have almost all actually
        occurring misspellings in that language.  Exact matching is
        much faster than fuzzy matching, and easier to design around.

Henri Chorand <address@hidden>:
        1) Parsing

        We need to parse the source segment and to split it into a sequence (a
        sorted list) of items (words, separators and tags)
        Definition:
        Word            sequence of contiguous alphabetic and/or numeric 
characters
        Separator       sequence of contiguous non-alphabetic and non-numeric
        characters
                        space
                        tab
                        punctuation marks . , ; : ! ? inverse? inverse!
                        ' " < > + - * / = _ ( ) [ ] { } hyphens & similar
                        various symbols (etc.)
                        non-breakable space
        Tag             tag belonging to our list of internal tags (let's 
consider
        our file is already converted)

"Thierry Sourbier" <address@hidden>:
        1) Unicode character properties should be used to determine if a 
character
        is a letter/digit/punctuation mark.

        2) Word breaking is a challenging problem (think Japanese, Thai...).

        3) You'll need to apply some kind of normalization such as case folding
        otherwise: "text" and "TEXT" will never fuzzy match. What about accents?

        4) If you want the TM to scale only store in the index the smallest
        "sub-words" (N-Gram in the litterature)

address@hidden (David N. Welton):
        > 1) Unicode character properties should be used to determine if a
        > character is a letter/digit/punctuation mark.

        Tcl deals with this out of the box, in theory:

        string is punct $foo

                punct     Any Unicode punctuation character.

        > 2) Word breaking is a challenging problem (think Japanese, Thai...).

        It also has a 'string wordend'.

        Don't know if these do what they should for Asian character
        sets, although my inclination on any open source project is to
        get it running first.  Maybe that means getting the project
        started with European languages, and then mixing in others. 
        This has the disadvantage that you might have to rework things
        later, but at least    you get something people can use and
        then they get interested in your project...

        > 4) If you want the TM to scale only store in the index the smallest
        > "sub-words" (N-Gram in the litterature)

        This is not really my department:-)

Henri Chorand again:
        > 2) Word breaking is a challenging problem (think Japanese, Thai...).

        As I see it, it's not too much a problem with Chinese or
        Japanese: kana (hiragana & katakana) use alphabets, so it
        should be dealt with like for other alphabet-based languages.  

        See at:
        http://kanji.free.fr/tabl_kana.php3?type=3Dhira
        http://kanji.free.fr/tabl_kana.php3?type=3Dkata

        Well, I know it's in French, but who cares at this stage :-)
        kanji use (possibly adapted) (differently articulated) Chinese
        characters= =2E Mainland China uses simplified Chinese
        characters, Taiwan uses traditiona= l Chinese ones. Chinese
        characters (ideograms) are in fact one-"letter" words, so each
        ideogram will be indexed without even needing to extract
        n-grams from it.

        Thai is much more a problem in that, at least traditionally,
        all words (which are mono-syllablic) are glued together to form
        a sentence.

        > 3) You'll need to apply some kind of normalization such as case 
folding
        > otherwise: "text" and "TEXT" will never fuzzy match. What about 
accents=
        ?

        I would say no for accents (let's consider them as they are in
        any charac= ter code set, different from the corresponding
        non-accented letters). And what about accented capital letters
        (can anybody confirm they are specific characters in Unicode,
        like in ANSI?) (see below about case management)

        Also remember that in French, for instance, accented and
        non-accented letters are the ONLY difference between different
        words (with different meanings), like in:
        "hue" ("gee up", to a horse) and "hu=E9" ("jeered")
        "rue" (street) and "ru=E9" (kicked)
        (sorry if these are the only examples my horse and I can think of).

        Case might be different. We should interpret a case difference
        as a "less close" match, but we might also:

        - ignore them during all indexing steps (index entries
        normalized into al= l lower-case strings)

        - re-inject this difference at the final stage (so as to lower
        match values).
        




reply via email to

[Prev in Thread] Current Thread [Next in Thread]