[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Grammatica-users] Want to Use the PArser as non-deterministic for N
From: |
Matti Katila |
Subject: |
Re: [Grammatica-users] Want to Use the PArser as non-deterministic for Natural Language Processing |
Date: |
Tue, 12 Jul 2005 17:08:06 +0300 (EEST) |
Hi Andres,
On Fri, 8 Jul 2005, Andres Hohendahl wrote:
> I am working in natural language processing (personal) project, to play
> around with syntactic, semantic and morphologic processing.
You have a big project I would say :)
> I want also to parse several part-of-speech segments for NL in order to
> get a correct grammar testing using EBNF and C# under .NET framework.
>
> There are lots of mutual excluding parts when defining the different
> tokens as words, and
I don't think it's much use for many different "word" tokens. Just slice
the text to sentences, then words and punctuation marks.
Then all the fun starts.
> the dictionary is not able nor practical to be loaded
> as EBNF,
True, and you even can not use grammatica with too rich grammar since
generated parser might expand out of 64K which is the maximum class size
with java (oh' I wouldn't count C# would still work fine =)
> also the natural grammar is heavily context or inter-token
> dependant,
Do you have a good word database?
Like if there is a input "A cat has a hat." it would match for Subject
Predicate Object pattern.
(sorry of my knowledge with spoken languages and right terms)
A,a = noun(alphabet), or adverb
cat = noun
hat = noun
has = verb
The token stream for such thing would be:
word(noun, adverb), word(noun), word(verb), word(noun, adverb) and word(noun)
Well, it sounds like you would need a context sensitive tokenizer where
different possibilities are tried to match for token stream.
> To allow this (I guess) I must make the tokenizer somewhat context-dependent
> and tokenize several alternate ways using a recursive pattern scanning,
> allowing it to explore the combinations or word-functions that best fits a
> production.
>
> I think this can be done adding a structure-layer on top of the Token /
> Tokenizer classes, producing a callback or event to allow external classes
> and methods to operate and get the context data for this token, and finally
> there must be a trial-error or scoring to select the most appropriate token
> which fulfills the production(s).
>
> I have already successfully coded several classes class which checks the
> functions of a word as a set of types, using affix-reduction, dictionary
> seek and intelligent de-stemming.
>
> Any suggestion or clue?
I couldn't follow the idea in the last three paragraphs but I wish you a
best luck with your project. Maybe I could understand if you provide some
examples.
-Matti