Re: UTF-8 doc scanning

help-flex

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 doc scanning

From:	Hans Aberg
Subject:	Re: UTF-8 doc scanning
Date:	Thu, 28 Oct 2004 19:21:13 +0200
User-agent:	Microsoft-Outlook-Express-Macintosh-Edition/5.0.6

At 11:43 -0700 2004/10/06, Raman Muthukrishnan wrote:
>Does anyone have experience with scanning a UTF-8 doc
>with UTF-8 regular expressions?
>Theoretically is a 8-bit scanner suited to match UTF-8
>regular expressions?

One idea is to perhaps make Flex to a UTF-8 scanner in the future. The
advantage of this approach is that the index ranges of the scanner tables do
not become larger. There is a patch for 16-bit characters, made in the days
Unicode would fit into 16 bits:
  ftp://ftp.lauton.com/pub/flex-2.5.4-unicode-patch.tar.gz
But then the scanner tables become very large, indexed over 2^16.

Unicode admits 21 bits; if UTF-21+ should be admitted, one needs to write
table compression algorithms.

-- 
  Hans Aberg

[Prev in Thread]

Current Thread

[Next in Thread]

UTF-8 doc scanning, Raman Muthukrishnan, 2004/10/06
- RE: UTF-8 doc scanning, Thurn, Martin, 2004/10/07
- Re: UTF-8 doc scanning, Hans Aberg <=

Prev by Date: Get great prices on medications
Next by Date: re [4]:
Previous by thread: RE: UTF-8 doc scanning
Next by thread: What is going in this syntax
Index(es):
- Date
- Thread