tetum-translators
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Tetum-translators] Machine Translation - some early considerations


From: Lev Lafayette
Subject: [Tetum-translators] Machine Translation - some early considerations
Date: Sun, 21 Mar 2004 23:10:47 -0800 (PST)

I've just picked up a useful (if old) book from a
local op-shop, Communication and Language (1965). It's
quite a scholarly overview of the subject.

A real treasure for our purposes is found in the
Appendix which deals with Machine Translation, which
derives from early attempts to translate English into
Russian. Whilst much of it deals with the ancient
methods of computerised storage (punch cards) and
character encoding, there is a very useful section on
programming and semantics.

Using the sentence "The pipe filled with water" as an
example, the book points out that "pipe" has at least
three meanings, water pipe, smoking pipe and organ
pipe which may be entirely different words in a
foreign language. The verb "to pipe" can be rejected
due to the previous definite article "the" - this
emphasizes the importance of translating _phrases_
rather than just words. 

Furthermore some elements of speech may be idiomatic.
The nearest foreign equivalent of "smoke the peace
pipe" might be "eat the peace bread". Likewise, "pipe
dreams" might be "cloud thoughts" or "piping hot" may
be "bakestone-hot" etc.

Finally, "filled" has no separate entry in its own in
English and is found under the root word.

Programming on the basis of context is considered
labour intensive and improbable to implement given
(remember this is an old book) storage requirements.
The following pseudo code is derived from the text.

Search preceeding and following words in sentence
[n.b. the sentence is a data structure that must be
used in machine translation for context-specific
words.]

If sentence contains "water" OR "buried" OR "metal"
then use translation  A

Else

If sentence contains "tobacco" OR "smoke" OR "puff"
then use translation B

Else 

If sentence contains "music" OR "tune" OR "play" then
use translation C


Now the extent of this (and, it must be pointed out,
WordNet) severely underestimates the complexity. For
example, a water pipe could also carry any form of
liquid or, in extremis, anything in liquid state! A
programme would have to account for lead pipes,
aluminimum pipes, steel pipes, plastic pipes, pipes
that carry water, gas(es), sewerage etc. 

Now whilst one could order according to likelihood,
the size (if not the complexity) of the project is
obvious - and this is just one word! As the article
mentions there are at least 20 common associated words
with pipe. A good sized dictionary (say 100,000 words)
would require at least 2,000,000 correlations!

[Nota bene: I'm not the world's best _detail_
programmer, although my broad pseudo-code is good. My
capacity to do data hack work (like tagging associated
words) is surprisingly high. Just mentioning this for
future allocation of resources.]

For idiomatic phrases (e.g., "It is raining cats and
dogs") can also be sorted in a similar manner (which
means we need to find a collection of Tetum,
Portuguese, English and Bahasa idioms). To translate
"rains cats and dogs" the programme must include
something like:

If next word after "cats and" = dogs and
Is the previous word "rain" 
Then use Idiom Y.

The example leads me to think that the idiom
translation check may even need to be the first thing
in the programme, or at least something bundled into a
"common phrases" component. I leave this question
moot.

The seperation of affixes from stems may have changed
since these early days of machine translation.
Nevertheless, I provide pseudo-code for the separation
process in English.


Function (split stems)
For each word in sentence
If word ends in 's' then split off 's', record rest of
word as stem
else
If word ends in 'g' and previous letter is 'n' and
previous letter is 'i' then split off 'ing', record
rest of word as stem
else
If word ends in 'd' and previous letter is 'e' then
split of 'ed', record rest of word as stem
else
Record word as stem
Consider next word


A recommended procedure for sentences at this stage
would be as follows

Input text
Check idioms
Split affixes from stems
Record stems and affixes
Translate stems
Translate affixes
Reorder text
Output text

"In Russian there are up to 12 inflected forms for
nouns, 20 for adjectives and nearly 100 for verbs. To
put every form in a dictionary as a separate entry
would be an inefficient use of storage space".

But not any longer I would suggest! This may be a
substantial saving to programming, albeit a hefty and
onerous addition to data entry.


=====
Lev Lafayette
address@hidden
http://au.geocities.com/lev_lafayette

__________________________________
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.
http://taxes.yahoo.com/filing.html




reply via email to

[Prev in Thread] Current Thread [Next in Thread]