[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I wa
From: |
Alex Shinn |
Subject: |
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want |
Date: |
Fri, 20 Jul 2012 22:44:28 +0900 |
On Fri, Jul 20, 2012 at 8:56 PM, Дмитрий <address@hidden> wrote:
>
> As for the character classes, they can be generated quite easily from the
> UnicodeData.txt[1] file. We can get a general category[2] from this file
> by sth like (string->symbol (caddr (string-split line ","))); then we just
> need to map the categories into appropriate character classes (e.g. Lu
> belongs to upper, alpha, alphanum, graph), etc. and merge characters if the
> characters of the same categories if they have adjacent codes.
> It's quite easy to do. If I'm not lazy I'll do this this weekend.
Full unicode character classes and case handling
are already in the utf8 egg.
These are not yet integrated with irregex because
irregex is written to be portable across any Scheme,
and so it uses its own char-set implementation. When
R7RS is released I'll re-package irregex accordingly.
Unfortunately, while the utf8 char-sets are very
compact, the DFA conversion of large, sparse Unicode
char-sets is quite large. I'd like eventually to make
a non-backtracking NFA regex matcher which only
compiles to DFA when you really need the speed.
In the meantime, a fast lookup table for the
script of a character would be nice, and this could
be use to tokenize a string of mixed-language text.
I thought I had this and can't seem to find it anywhere...
--
Alex