[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new modules for Unicode normalization
From: |
Bruno Haible |
Subject: |
Re: new modules for Unicode normalization |
Date: |
Sat, 21 Feb 2009 16:07:53 +0100 |
User-agent: |
KMail/1.9.9 |
Hi Pádraig, Bo,
On 2008-05-08, when I mentioned the possibility to have a filter program that
reads from standard input and writes the canonicalized output to standard
output, you liked this idea:
<http://lists.gnu.org/archive/html/bug-coreutils/2008-05/msg00062.html>
<http://lists.gnu.org/archive/html/bug-coreutils/2008-05/msg00063.html>
I have now added to gnulib a module for Unicode normalization of streams of
Unicode characters. It's called 'uninorm/filter'; the API is declared in
uninorm.h:
-------------------------------------------------------------------------------
/* Normalization of a stream of Unicode characters.
A "stream of Unicode characters" is essentially a function that accepts an
ucs4_t argument repeatedly, optionally combined with a function that
"flushes" the stream. */
/* Data type of a stream of Unicode characters that normalizes its input
according to a given normalization form and passes the normalized character
sequence to the encapsulated stream of Unicode characters. */
struct uninorm_filter;
/* Create and return a normalization filter for Unicode characters.
The pair (stream_func, stream_data) is the encapsulated stream.
stream_func (stream_data, uc) receives the Unicode character uc
and returns 0 if successful, or -1 with errno set upon failure.
Return the new filter, or NULL with errno set upon failure. */
extern struct uninorm_filter *
uninorm_filter_create (uninorm_t nf,
int (*stream_func) (void *stream_data, ucs4_t uc),
void *stream_data);
/* Stuff a Unicode character into a normalizing filter.
Return 0 if successful, or -1 with errno set upon failure. */
extern int
uninorm_filter_write (struct uninorm_filter *filter, ucs4_t uc);
/* Bring data buffered in the filter to its destination, the encapsulated
stream.
Return 0 if successful, or -1 with errno set upon failure.
Note! If after calling this function, additional characters are written
into the filter, the resulting character sequence in the encapsulated stream
will not necessarily be normalized. */
extern int
uninorm_filter_flush (struct uninorm_filter *filter);
/* Bring data buffered in the filter to its destination, the encapsulated
stream, then close and free the filter.
Return 0 if successful, or -1 with errno set upon failure. */
extern int
uninorm_filter_free (struct uninorm_filter *filter);
-------------------------------------------------------------------------------
With this, you can easily create a program that reads UTF-8 from stdin and
outputs it as canonicalized UTF-8 on stdout:
- create a "stream" that takes a Unicode character and outputs it to
stdout. (Gnulib module 'unistr/u8-uctomb'.)
- Wrap a Unicode normalizing filter around it. (Gnulib module
'uninorm/filter'.)
- Feed it with Unicode characters from standard input. (Gnulib module
unistr/u8-mbtouc'.)
I would love to see such a program in coreutils. But I am not a coreutils
maintainer.
Bruno
- Re: new modules for Unicode normalization,
Bruno Haible <=
Re: new modules for Unicode normalization, Bruno Haible, 2009/02/21