[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: UTF-8 doc scanning
From: |
Thurn, Martin |
Subject: |
RE: UTF-8 doc scanning |
Date: |
Thu, 7 Oct 2004 08:20:14 -0400 |
> Theoretically is a 8-bit scanner suited to match UTF-8
> regular expressions?
Depends on what exactly you mean by "UTF-8 regexen". Start by reading the
UTF-8 spec and create patterns. I did this years ago and my patterns looked
like this (each match is ONE unicode character). The standard may have changed
since then.
UB [\200-\277]
%%
[\300-\337]{UB} { UNICODE }
[\340-\357]{UB}{2} { UNICODE }
[\360-\367]{UB}{3} { UNICODE }
[\370-\373]{UB}{4} { UNICODE }
[\374-\375]{UB}{5} { UNICODE }
- - Martin