help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Multi-byte support in 'flex'


From: Hack, Peter
Subject: RE: Multi-byte support in 'flex'
Date: Wed, 23 May 2001 13:50:44 -0400

        I've added some answers interspersed below.

> -----Original Message-----
> From: Kingpin [mailto:address@hidden
> Sent: Wednesday, May 23, 2001 9:03 AM
> To: address@hidden
> Cc: address@hidden
> Subject: Re: Multi-byte support in 'flex'
> 
> 
> >     I've added support for multibyte characters to 
> 'flex-2.5.4a'.  These
> 
> Please explain.  What is your definition of "multibyte characters"?
> Japanese?  Chinese?  Hex digit pairs???  Arabic?  What encoding?
> Unicode?

The change I made is not specific to a particular encoding.  It should
work for any multibyte encoding.

> 
> What is your definition of "allow"?

My changes eliminate erroneous pattern matches in cases I describe below.

> 
> What is it about plain flex that does NOT "allow" multibyte characters
> in the input?

The biggest problem with multibyte and 'flex' has been dealing with SJIS.
SJIS has some characters whose 2nd byte has the same code as some 7-bit
ASCII characters.  This would cause patterns containing these 7-bit
ASCII characters (like '\') to match when they shouldn't.  EUC and UTF-8
don't have this unfortunate characteristic.

> 
> You don't need to change flex to be able to scan Unicode, you just
> need to write your own patterns to match everything.  (Yes, I have
> done this for UTF-8 encoding.)

I'll briefly describe what I did.  You may not be interested in this
kind of solution because of it's restrictions.  I'll call the modified
version of 'flex', 'flex2'.  The changes were fairly simple.

'flex2' generates scanner code that no longer uses the type, 'char' to
hold characters from the input to be scanned.  Instead, 'FLEX_CHAR_T'
is used and is #define'd to be either 'char' or 'wchar_t' depending
on a compile-time flag.  If 'char' is chosen, 'flex2' scanners function
just the same as 'flex' scanners.  If 'wchar_t' is chosen, the
following changes are made:

1)  YY_INPUT() must now convert the input stream from multibyte to wide
        characters before placing them in the input buffer.
2)  All internal scanner pattern matching is done using 'wchar_t's.
3)  All 'wchar_t's outside of the 8-bit range are mapped to a single
        8-bit character when calculating state transitions.  By default,
        this is '^A' (0x01) but can be defined by the '.l' author to be
        any 8-bit character.
4)  Tokens are assembled in a new buffer, 'yywtext'.  This contents of
        this buffer are converted back to multibyte characters as they
        are copied to 'yytext' just before lex actions are done (in
        YY_DO_BEFORE_ACTION).  Thus, lex actions don't have to be changed
        to deal with 'wchar_t'.  If they wish, they can use 'yywtext'
        instead should they wish to use 'wchar_t' token strings.

                                Peter

> 
> -- 
>  - - Martin "Kingpin" Thurn                    address@hidden
>      Research Software Engineer           (703) 793-3700 x2651
>      The Information Refinery              http://tir.tasc.com
>      TASC, Inc.                            http://www.tasc.com
> 
> Size matters not.  Look at me! -- Yoda, The Empire Strikes Back
> 

Original message:
>>      I've added support for multibyte characters to 'flex-2.5.4a'.  These
>> changes do NOT allow multibyte characters to appear in patterns but allow
the
>> input to the generated scanner to contain multibyte characters.  I did
not add
>> support to all 'flex' modes (ex. '-+' to generate a C++ scanner)
primarily
>> because I don't have need for those modes but partly because I don't have
time
>> to develop tests for all of them.
>>      Is this of any value to GNU?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]