bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new modules for Unicode normalization


From: Bruno Haible
Subject: Re: new modules for Unicode normalization
Date: Sat, 21 Feb 2009 16:07:53 +0100
User-agent: KMail/1.9.9

Hi Pádraig, Bo,

On 2008-05-08, when I mentioned the possibility to have a filter program that
reads from standard input and writes the canonicalized output to standard
output, you liked this idea:
  <http://lists.gnu.org/archive/html/bug-coreutils/2008-05/msg00062.html>
  <http://lists.gnu.org/archive/html/bug-coreutils/2008-05/msg00063.html>

I have now added to gnulib a module for Unicode normalization of streams of
Unicode characters. It's called 'uninorm/filter'; the API is declared in
uninorm.h:

-------------------------------------------------------------------------------
/* Normalization of a stream of Unicode characters.

   A "stream of Unicode characters" is essentially a function that accepts an
   ucs4_t argument repeatedly, optionally combined with a function that
   "flushes" the stream.  */

/* Data type of a stream of Unicode characters that normalizes its input
   according to a given normalization form and passes the normalized character
   sequence to the encapsulated stream of Unicode characters.  */
struct uninorm_filter;

/* Create and return a normalization filter for Unicode characters.
   The pair (stream_func, stream_data) is the encapsulated stream.
   stream_func (stream_data, uc) receives the Unicode character uc
   and returns 0 if successful, or -1 with errno set upon failure.
   Return the new filter, or NULL with errno set upon failure.  */
extern struct uninorm_filter *
       uninorm_filter_create (uninorm_t nf,
                              int (*stream_func) (void *stream_data, ucs4_t uc),
                              void *stream_data);

/* Stuff a Unicode character into a normalizing filter.
   Return 0 if successful, or -1 with errno set upon failure.  */
extern int
       uninorm_filter_write (struct uninorm_filter *filter, ucs4_t uc);

/* Bring data buffered in the filter to its destination, the encapsulated
   stream.
   Return 0 if successful, or -1 with errno set upon failure.
   Note! If after calling this function, additional characters are written
   into the filter, the resulting character sequence in the encapsulated stream
   will not necessarily be normalized.  */
extern int
       uninorm_filter_flush (struct uninorm_filter *filter);

/* Bring data buffered in the filter to its destination, the encapsulated
   stream, then close and free the filter.
   Return 0 if successful, or -1 with errno set upon failure.  */
extern int
       uninorm_filter_free (struct uninorm_filter *filter);
-------------------------------------------------------------------------------

With this, you can easily create a program that reads UTF-8 from stdin and
outputs it as canonicalized UTF-8 on stdout:
  - create a "stream" that takes a Unicode character and outputs it to
    stdout. (Gnulib module 'unistr/u8-uctomb'.)
  - Wrap a Unicode normalizing filter around it. (Gnulib module
    'uninorm/filter'.)
  - Feed it with Unicode characters from standard input. (Gnulib module
    unistr/u8-mbtouc'.)

I would love to see such a program in coreutils. But I am not a coreutils
maintainer.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]