tetum-translators
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tetum-translators] Machine Translation - some early considerations


From: Peter Gossner
Subject: Re: [Tetum-translators] Machine Translation - some early considerations
Date: Wed, 24 Mar 2004 16:42:15 +1030

On Sun, 21 Mar 2004 23:10:47 -0800 (PST)  from a terminal far far away
<Lev/>  wrote:
>

Hey there Lev got this the other day and am still digesting the
ramifications.. I generally agree with everything but make two OBs:

1/ The earlier we error check the (exponetially) better the payback
- and the single word is an appropriate first check (still) i believe
- phrases after that. (cool)

2/An interface to "teaching" the "spirit reference system" the sentence
structures and phrase associations may be more important than the user
interface. (chicken and egg moment there though)

i.e No human being has a long enough life to enter every possible
combination ... My time (at least) would be better spent automating the
tagging association engine . (even if that alone took a whole year it
would still have a huge payback. X number of users.)

I think I have the beginnings of an algorithum forming:

process view like so:(per sentence)

check spelling -> tag unknowns -> identify sentence ->map syntax /
POSpeech->
-> tag plurals / *ymns (syn hom etc)
 [ somehow ID known phrase / clause associations e.g. pipe : water :send
:smoke[nb case lead] -> generate specific case association(s) (and keep
them for later use ?) -> map the target language syntax -> map target
phrases-> eliminate bad cases-> write translation

Then offer reverse translation to see how sensible ?
not sensible: offer alternative 1 etc.
offer to remember / learn associations.

I need to clean that up but hopefully you can see the idea.
(after all it's yours :)

man that Database engine is going to have to be slick !
(and human friendly)

End Thoughts for the day:
Pete

-- 
Todays fortune:
Q:      What is printed on the bottom of beer bottles in Minnesota?
A:      Open other end.
     
< http://www.gnu.org/software/tetum/ >
< http://bigbutton.com.au/~gossner >
< address@hidden >


>I've just picked up a useful (if old) book from a
>local op-shop, Communication and Language (1965). It's
>quite a scholarly overview of the subject.
>
>A real treasure for our purposes is found in the
>Appendix which deals with Machine Translation, which
>derives from early attempts to translate English into
>Russian. Whilst much of it deals with the ancient
>methods of computerised storage (punch cards) and
>character encoding, there is a very useful section on
>programming and semantics.
>
>Using the sentence "The pipe filled with water" as an
>example, the book points out that "pipe" has at least
>three meanings, water pipe, smoking pipe and organ
>pipe which may be entirely different words in a
>foreign language. The verb "to pipe" can be rejected
>due to the previous definite article "the" - this
>emphasizes the importance of translating _phrases_
>rather than just words. 
>
>Furthermore some elements of speech may be idiomatic.
>The nearest foreign equivalent of "smoke the peace
>pipe" might be "eat the peace bread". Likewise, "pipe
>dreams" might be "cloud thoughts" or "piping hot" may
>be "bakestone-hot" etc.
>
>Finally, "filled" has no separate entry in its own in
>English and is found under the root word.
>
>Programming on the basis of context is considered
>labour intensive and improbable to implement given
>(remember this is an old book) storage requirements.
>The following pseudo code is derived from the text.
>
>Search preceeding and following words in sentence
>[n.b. the sentence is a data structure that must be
>used in machine translation for context-specific
>words.]
>
>If sentence contains "water" OR "buried" OR "metal"
>then use translation  A
>
>Else
>
>If sentence contains "tobacco" OR "smoke" OR "puff"
>then use translation B
>
>Else 
>
>If sentence contains "music" OR "tune" OR "play" then
>use translation C
>
>
>Now the extent of this (and, it must be pointed out,
>WordNet) severely underestimates the complexity. For
>example, a water pipe could also carry any form of
>liquid or, in extremis, anything in liquid state! A
>programme would have to account for lead pipes,
>aluminimum pipes, steel pipes, plastic pipes, pipes
>that carry water, gas(es), sewerage etc. 
>
>Now whilst one could order according to likelihood,
>the size (if not the complexity) of the project is
>obvious - and this is just one word! As the article
>mentions there are at least 20 common associated words
>with pipe. A good sized dictionary (say 100,000 words)
>would require at least 2,000,000 correlations!
>
>[Nota bene: I'm not the world's best _detail_
>programmer, although my broad pseudo-code is good. My
>capacity to do data hack work (like tagging associated
>words) is surprisingly high. Just mentioning this for
>future allocation of resources.]
>
>For idiomatic phrases (e.g., "It is raining cats and
>dogs") can also be sorted in a similar manner (which
>means we need to find a collection of Tetum,
>Portuguese, English and Bahasa idioms). To translate
>"rains cats and dogs" the programme must include
>something like:
>
>If next word after "cats and" = dogs and
>Is the previous word "rain" 
>Then use Idiom Y.
>
>The example leads me to think that the idiom
>translation check may even need to be the first thing
>in the programme, or at least something bundled into a
>"common phrases" component. I leave this question
>moot.
>
>The seperation of affixes from stems may have changed
>since these early days of machine translation.
>Nevertheless, I provide pseudo-code for the separation
>process in English.
>
>
>Function (split stems)
>For each word in sentence
>If word ends in 's' then split off 's', record rest of
>word as stem
>else
>If word ends in 'g' and previous letter is 'n' and
>previous letter is 'i' then split off 'ing', record
>rest of word as stem
>else
>If word ends in 'd' and previous letter is 'e' then
>split of 'ed', record rest of word as stem
>else
>Record word as stem
>Consider next word
>
>
>A recommended procedure for sentences at this stage
>would be as follows
>
>Input text
>Check idioms
>Split affixes from stems
>Record stems and affixes
>Translate stems
>Translate affixes
>Reorder text
>Output text
>
>"In Russian there are up to 12 inflected forms for
>nouns, 20 for adjectives and nearly 100 for verbs. To
>put every form in a dictionary as a separate entry
>would be an inefficient use of storage space".
>
>But not any longer I would suggest! This may be a
>substantial saving to programming, albeit a hefty and
>onerous addition to data entry.
>
>
>=====
>Lev Lafayette
>address@hidden
>http://au.geocities.com/lev_lafayette
>
>__________________________________
>Do you Yahoo!?
>Yahoo! Finance Tax Center - File online. File on time.
>http://taxes.yahoo.com/filing.html
>
>
>_______________________________________________
>Tetum-translators mailing list
>address@hidden
>http://mail.nongnu.org/mailman/listinfo/tetum-translators





reply via email to

[Prev in Thread] Current Thread [Next in Thread]