RE: UTF-8 doc scanning

help-flex

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: UTF-8 doc scanning

From:	Thurn, Martin
Subject:	RE: UTF-8 doc scanning
Date:	Thu, 7 Oct 2004 08:20:14 -0400

> Theoretically is a 8-bit scanner suited to match UTF-8
> regular expressions?

  Depends on what exactly you mean by "UTF-8 regexen".  Start by reading the 
UTF-8 spec and create patterns.  I did this years ago and my patterns looked 
like this (each match is ONE unicode character).  The standard may have changed 
since then.  

UB     [\200-\277]
%%
[\300-\337]{UB}             { UNICODE }
[\340-\357]{UB}{2}          { UNICODE }
[\360-\367]{UB}{3}          { UNICODE }
[\370-\373]{UB}{4}          { UNICODE }
[\374-\375]{UB}{5}          { UNICODE }

 - - Martin

[Prev in Thread]

Current Thread

[Next in Thread]

UTF-8 doc scanning, Raman Muthukrishnan, 2004/10/06
- RE: UTF-8 doc scanning, Thurn, Martin <=
- Re: UTF-8 doc scanning, Hans Aberg, 2004/10/28

Prev by Date: (no subject)
Next by Date: Buy cheap Vìagra through us.
Previous by thread: UTF-8 doc scanning
Next by thread: Re: UTF-8 doc scanning
Index(es):
- Date
- Thread