Re: bison for nlp

help-bison
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bison for nlp

From:	r0ller
Subject:	Re: bison for nlp
Date:	Mon, 19 Nov 2018 16:09:14 +0100 (CET)
Hi Akim,

I managed to take the first step and get it running but it wasn't as easy as I 
thought. First, I wanted to take the approach that the 'Simple C++ Example' 
demonstrates in the bison manual. However, I could not figure out what my 
yylex() should return when defining api.value.type variant. In the C parser I 
used to return an int so to avoid turning everything upside down in one big 
step, I sticked to that concept and tried what happens if I define my tokens as 
%token <int> mytoken number where 'mytoken' is just a token name and 'number' 
is the number assigned to it. As a real example:

%token <int> t_Con 1

The question is, in case of having many tokens like I do, how do I decide which 
shall be returned, as bison now generates a symbol_type make_TOKEN() for each 
token which I shall be able to return in yylex(). Though, I'd rather not put a 
huge switch() in yylex(). Is there any other solution like defining a "dummy" 
token like

%token <int> INT;

whose constructor make_INT(const& int) would simply return the int passed to 
it? Or shall I simply try to cast the integers of my tokens to symbol_type?

The other problem I ran into was related to the non-terminals: wherever I 
wanted to read the value of a symbol in an action via e.g. $1, I got an error 
about type conversion as it could not be converted any more to an integer as in 
the C parser. For this I have only one guess namely, that each non-terminal 
needs a %type declaration like

%type <int> ENG_Con;

and even the = operator needs to be defined for it, right? So here I got stuck 
at least with regards to api.value.type variant.

Then I decided to take a step back and not to use complete symbols but split 
symbols for a first try. This I managed to figure out and make it work but with 
a small hack as I declared yylex as:

int yylex(int* yylval);

If I did it like:

int yylex(semantic_type* yylval);

the compiler kept complaining about not knowing semantic_type (nor 
parser::semantic_type, nor yy::parser::semantic_type). So I took clang's hint 
when it said semantic_type* is aka int* and it worked. In the end, to make my 
hack a bit more nicer, I added %define api.value.type {int}. Still, the 
question in this case would be, how shall yylex() be correctly declared?

I also attached the source I ended up with which was originally generated for a 
C parser but I manually modified and played around with it till I got it 
working.

Thanks for any hints!

Best regards,
r0ller

-------- Eredeti levél --------
Feladó: r0ller < address@hidden (Link -> mailto:address@hidden) >
Dátum: 2018 november 12 11:43:47
Tárgy: Re: bison for nlp
Címzett: Akim Demaille < address@hidden (Link -> mailto:address@hidden) >
 
Hi Akim,

Sorry for the delay, I had to go through my own code to be able to answer your 
question about the tokens:) But to begin with your first observation, you're 
right: I should wrap that conditional ternary op for logging.

After going through the code, I concluded that currently it's only used to make 
sure that a constant (or unknown word) is mapped to the same symbol in each 
language having the token value 1. But it could be solved in a different way 
e.g. that each language can define its own symbol for constants in a newly 
introduced (customizing) db table. As you can guess, currently if the program 
bumps into a token with value 1, then it assumes that it's an unknown 
word/morpheme/constant. It seemed ok 8 years ago, but now it has it 's price to 
turn back the wheels. However, I think I'll do it even though it seems that 
individual numbering does not cause any problem as there are only two numbers 
to avoid conflicts (0 and 256). The other place where it'd be used is the 
symbol prediction (where I need to remap a token to a symbol) in case of an 
error but that method is currently not called at all as it does not yet work 
well and now I just return the bison error message about the expected symbols. 
Mine would have the additional functionality on top that it'd not only tell 
what's syntactically expected but would return a subset of those symbols which 
are semantically expected.

Concerning the c++ bison wrapper, what I mentioned is simply that I read 
somewhere an article in 2010 when I started the project which made that 
statement and I didn't even validate it. But now I'm pretty curious about its 
c++ features so I'll definitely go through the documentation you sent and try 
to turn mine into a c++ parser:)

Best regards,
r0ller
-------- Eredeti levél --------
Feladó: Akim Demaille < address@hidden (Link -> mailto:address@hidden) >
Dátum: 2018 november 9 06:13:29
Tárgy: Re: bison for nlp
Címzett: r0ller < address@hidden (Link -> mailto:address@hidden) >
 
Hi!
> Le 7 nov. 2018 à 10:09, r0ller <address@hidden> a écrit :
>
> Hi Akim,
>
> The file hi_nongen.y is just left there as the last version that I wrote 
> manually:) If you check out any other hi.y files in the platform specific 
> directories (e.g. the one for the online demo is 
> https://github.com/r0ller/alice/blob/master/hi_js/hi.y but you can have a 
> look in hi_android or hi_desktop as well) you’ll see how they look like 
> nowadays.
You have tons of
logger::singleton()==NULL?(void)0:logger::singleton()->log(2,"vm is NULL!");
you could introduce logger::log, or whatever free function,
that does that for you instead of having to deal with that
in every call site.
> Numbering tokens was introduced in the very beginning and has been questioned 
> by myself quite a many times if it's still needed. I didn’t give a hard try 
> to get rid of it mainly due to one reason: I want to have an error handling 
> that tells in case of an error which symbols could be accepted instead of the 
> erroneous one just as bison itself does it but in a structured way (as bison 
> returns that info in an error message string).
Where are these numbers used?
> Though, I could not come up with any better idea when it comes to remapping a 
> token to a symbol. As far as I know bison uses internally the tokens and not 
> the symbols for the terminals and it's not possible to get back a symbol 
> belonging to a certain token. That's it roughly but I'd be glad to get rid of 
> it. However, if it's not possible and poses no problems then I can live with 
> it. By the way, are there any number ranges or specific numbers that are 
> reserved?
Some numbers are reserved, yes: 0 for eof and 256 for error (per POSIX). For 
error, Bison can accommodate if you use 256. EOF must be 0.
> Not using the C++ features of bison has historical reasons: I started writing 
> the project in C and even back then I used yacc which I later replaced with 
> bison. When I started to shift the project to C++ I was glad that it still 
> worked with the generated C parser and since then I never had time to make 
> such an excursion but it'd be great. I also must admit that I wasn't really 
> aware of it. The only thing I read somewhere was that bison has a C++ wrapper 
> but have never taken any steps into that direction.
I don’t know what you mean here: this is bison itself, there’s
no need for a wrapper, and the deterministic parser itself is
genuine C++, not C++ wrapping C. The GLR parser in C++ though _is_
a wrapper for the C GLR parser.
> Now I think I'll find some time for it -at least to check it out:) Could you 
> give me any links pointing to any tutorial or something like that? It’d be 
> very kind if you could help me in taking the first steps, thanks!
I would very like to have your opinion on the open section of the
documentation about C++. It’s recent, and it probably needs polishing.
https://www.gnu.org/software/bison/manual/bison.html#A-Simple-C_002b_002b-Example
hi.y
Description: Binary data
[Prev in Thread]
Current Thread
[Next in Thread]
Re: bison for nlp, (continued)
Prev by Date: Fwd: new pacification suggestion?
Next by Date: Re: bison for nlp
Previous by thread: Re: bison for nlp
Next by thread: Re: bison for nlp
Index(es):
- Date
- Thread