RE: unicode support in flex

help-flex
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: unicode support in flex

From:	Mark Weaver
Subject:	RE: unicode support in flex
Date:	Thu, 24 Jan 2002 14:19:53 -0000
OK, I posted recently about having integrated the unicode patch into a
re-entrant flex.  I have tested this relatively loosely; as in built simple
Unicode scanners that seem to work fine.  It would be helpful to have a
number of older scanners to teset this on.  There are a few issues to be
worked out:

1) Flex currently uses tables on the stack that are the size of the number
of characters in the largest supported character set.  With the unicode
patch, this goes up to 64K characters from 256, giving you a much larger
stack requirement (it certainly exceeds 2Mb, and I'm guessing about 4Mb).
This causes a default Win32 compile (which doesn't seem to be of interest to
ppl here anyway) to blow up as the default stack is 1Mb (this is easily
fixed).  I don't know what issues this will cause under other platforms.  It
might make more sense for the purposes of consistency to allocate the tables
dynamically, so that you don't suffer the memory penalty when using a
modified flex to compile 7- or 8-bit scanners.

2) The unicode flex defines "YY_CHAR" as the type used for compiling the
scanner.  The re-entrant flex defines a new state structure for each scanner
(PREFIX_flex_state).  To make the two work together, I've had to define the
character type PREFIX_CHAR in the generated header file for a re-entrant
scanner.  There is an icky ordering issues in the way that the re-entrant
header is included which means that YY_CHAR is needed before including
YY_FLEX_STATE (specifically, that YY_BUFFER_STATE needs to be defined).  To
get around this, I protect the definition of PREFIX_CHAR with the macro
PREFIX_CHAR_DEFINED, and reproduce the definition of YY_CHAR in the scanner
file.  It doesn't look amazingly pretty, but it does work, and this only
applies when compiling the generated scanner; the unpretty bit is the extra
macro.  I'd be interested to know if anyone can think of a way around this.
(The YY_BUFFER_STATE structure of course uses YY_CHAR itself - hence the
slightly icky solution).

3) As to if this is the correct method of supporting unicode, well possibly
not.  It works fine for 16-bit unicode character sets, but wouldn't work for
the (rarely used) 32-bit character set.  This isn't in common use however
(cf xml.apache project which supports only UTF-16).  I'm not familiar with
the flex internals, but the "super-fast" (sorry forgot the name)
optimisation is not allowed to be used with this patch.  Presumably this is
to do with issues caused by the table sizes generated with 64K characters?
I would be grateful if someone more knowledgable would comment on this.

4) There is also an issue related to the above, which is that flex itself
uses 8-bit files as input.  If someone is developing a real unicode scanner,
it would probably be helpful to them to be able to specify an unicode
scanner definition file to flex; then they can simply edit this directly to
include whatever characters they need to use.  This shouldn't be too hard as
with the patched flex, you can generate a scanner that understands unicode.
I would consider this a secondary priority however as there is scope for
doing this using character escapes - this is just an extra feature.

5) As to the scope of the modifications, they are limited but reasonably
spread out.  The change notes from the patch are:

- ccl.c
  - ccladd()
    - changed call to reallocate_Character_array to reallocate_wchar_array

- ecs.c
  - mkeccl()
    - changed type of ccls from Char to wchar_t

- flex.1
  - added description of -U option
  - added extra qualifier to -Ca option regarding usage with -U
  - modified -C, -Cf, and -CF options regarding usage with -U

- flex.skl
  - changed all references of char (except error messages) to YY_CHAR
  - added new insertion point for defining YY_CHAR and YY_SC_TO_UI()
  - yy_scan_bytes()
    - renamed to yy_scan_chars to avoid confusion with 2-byte chars
    - renamed param bytes to chars
  - ECHO
    - redefined C version as fwrite(yytext, sizeof(YY_CHAR), yyleng, yyout)
  - YY_INPUT
    - removed char* cast on param buf of C++ version
  - yyFlexLexer::LexerInput()
    - changed get() call to read((unsigned char *) buf, sizeof(YY_CHAR))
    - changed read() call to read((unsigned char *) buf, max_size *
      sizeof(YY_CHAR))
    - changed gcount() call to gcount() / sizeof(YY_CHAR)
  - yyFlexLexer::LexerOutput()
    - changed write() call to write((unsigned char *) buf, size *
      sizeof(YY_CHAR))
  - yy_get_next_buffer()
    - yy_flex_realloc() call
      - changed param b->yy_buf_size + 2 to
        (b->yy_buf_size + 2) * sizeof(YY_CHAR)
  - input() and yyFlexLexer::yyinput()
    - changed line c = *(unsigned char *) yy_c_buf_p; to
      c = YY_SC_TO_UI(*yy_c_buf_p);

- flexdef.h
  - defined CSIZE as 65536
  - changed myesc() proto to return int
  - changed type of ccltbl from Char * to wchar_t *
  - added allocate_wchar_array() and reallocate_wchar_array() macros
  - changed mkeccl() proto's first param to wchar_t[]
  - changed cshell() proto's first param to wchar_t[]


- FlexLexer.h
  - changed all references of char (except error messages) to YY_CHAR
  - added description about typedef'ing YY_CHAR before inclusion

- gen.c
  - changed appropriate references of char in output strings to YY_CHAR
  - added C_uchar_decl and C_ushort_decl for 16-bit yy_ec type bump-down
  - genecs()
    - added code for 16-bit yy_ec type bump; -Ca bumps type to long
  - make_tables()
    - YY_INPUT
      - read redefined as read(..., max_size * sizeof(YY_CHAR))
      - added code to use getwc() and WEOF for 16-bit interactive
      - fread redefined as fread(buf, sizeof(YY_CHAR), max_size, yyin)

- main.c
  - changed type of ccltbl from Char * to wchar_t *
  - check_options()
    - changed default csize from CSIZE to 256 due to redef of CSIZE
    - added code to check for options incompatible with -U
    - added code to ouput wchar.h include and typedef of YY_CHAR to
skeleton,
      plus extra call to skelout() to get down to original insertion point
  - flexend()
    - added code to print "U" when printing stats
  - flexinit()
    - added code set csize for option -U
    - changed assignment of csize in option -8 from CSIZE to 256 due to
redef
      of CSIZE
  - readin()
    - changed appropriate references of char in output strings to YY_CHAR
    - removed output of YY_CHAR typedef; now located in check_options()
  - usage()
    - added fprintf for -U usage

- misc.c
  - check_char()
    - added code to distinguish chars needing -8 and -U flags
  - cshell()
    - changed type of v from Char to wchar_t
    - changed type of k from Char to wchar_t
  - myesc()
    - now returns an int to handle 16-bit escape sequences
    - changed esc_char from Char to unsigned int as per htoi() and otoi()

- scan.l
  - changed ESCSEQ to accept 6 digit octal escapes and 4 digit hex escapes
  - removed myesc() and ndlookup() protos
  - added option "16bit"

- tblcmp.c
  - mktemplate
    - changed type of transset from Char to wchar_t

Thanks,

Mark

> -----Original Message-----
> From: address@hidden [mailto:address@hidden Behalf
> Of John W. Millaway
> Sent: 24 January 2002 13:36
> To: flex list
> Cc: Guillaume Morin
> Subject: Re: unicode support in flex
>
>
> > I would like to know if there are any plans to integrate the famous flex
> > unicode patch. If not, I'd like to know why since the patch does not
> > seem intrusive and is very useful.
>
> Yes, it is now on the table. However, the unicode patch is very
> intrusive and
> widespread, actually, as compared to, say, the yylineno/REJECT
> fix we talked
> about last week. So it needs thorough testing. As far as I know, it has
> undergone none. The reentrant stuff is done, now, so the patch can't be
> *directly* applied to the source anyway. But it shouldn't be a
> problem to redo
> the patch, in spirit, since it appears to be nicely documented
> and we now have
> a battery of regression tests. Also, we should verify that the
> methods used in
> the patch (~10 years ago?) are still the correct methods used to
> supporting
> unicode, and more importantly, that the tables don't explode.
>
>
> __________________________________________________
> Do You Yahoo!?
> Great stuff seeking new owners in Yahoo! Auctions!
> http://auctions.yahoo.com
>
> _______________________________________________
> Help-flex mailing list
> address@hidden
> http://mail.gnu.org/mailman/listinfo/help-flex
>
[Prev in Thread]
Current Thread
[Next in Thread]
unicode support in flex, Guillaume Morin, 2002/01/24
- Re: unicode support in flex, John W. Millaway, 2002/01/24
  - RE: unicode support in flex, Mark Weaver <=
    - RE: unicode support in flex, Hans Aberg, 2002/01/24
Prev by Date: Re: unicode support in flex
Next by Date: yyparse more than once
Previous by thread: Re: unicode support in flex
Next by thread: RE: unicode support in flex
Index(es):
- Date
- Thread