[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 doc scanning
From: |
Hans Aberg |
Subject: |
Re: UTF-8 doc scanning |
Date: |
Thu, 28 Oct 2004 19:21:13 +0200 |
User-agent: |
Microsoft-Outlook-Express-Macintosh-Edition/5.0.6 |
At 11:43 -0700 2004/10/06, Raman Muthukrishnan wrote:
>Does anyone have experience with scanning a UTF-8 doc
>with UTF-8 regular expressions?
>Theoretically is a 8-bit scanner suited to match UTF-8
>regular expressions?
One idea is to perhaps make Flex to a UTF-8 scanner in the future. The
advantage of this approach is that the index ranges of the scanner tables do
not become larger. There is a patch for 16-bit characters, made in the days
Unicode would fit into 16 bits:
ftp://ftp.lauton.com/pub/flex-2.5.4-unicode-patch.tar.gz
But then the scanner tables become very large, indexed over 2^16.
Unicode admits 21 bits; if UTF-21+ should be admitted, one needs to write
table compression algorithms.
--
Hans Aberg