bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

unicode string functions


From: Bruno Haible
Subject: unicode string functions
Date: Tue, 2 Jan 2007 22:39:18 +0100
User-agent: KMail/1.9.1

Hi,

Since 2000 the need for elementary functions on Unicode strings has been
apparent and increasing:
  - some utility functions exist in GNOME's glib,
  - clisp, gettext, emacs, python, ... need a programmatic access to the
    character name tables,
  - gettext's linebreak module relies on several utility functions for
    Unicode strings,
  - any program printing line + column numbers of characters in a file
    needs to consider the width of the character, e.g. libiconv does this,
  - clisp would like to have Unicode regular expressions that work even
    when the locale is in ISO-8859-1 encoding,
  ...

Since 2001 I've been working on a library covering such topics. But two issues
kept me from releasing this library:
  - It should be more lightweight than IBM's ICU library. It should contain
    many functions, and support all 3 kinds of in-memory representation (UTF-8,
    UTF-16 and UTF-32), but without installing a multi-megabyte library.
    Someone wanting 2 or 3 Unicode string functions does not want to link
    with a megabyte big library.
  - The basic character type, ucs4_t, is an alias of uint32_t. But
    one could not assume <stdint.h>.

Gnulib solves both issues 1. by providing an infrastructure for a source-code
library, 2. by providing a package independent <stdint.h>.

These data types are actually suitable for gnulib, since they are basic
and project independent.

I'll therefore add a set of modules for Unicode text handling.

The choice of the in-memory representation (UTF-8, UTF-16 or UTF-32) is up to
the application; libunistring supports all three equally.

The modules are organized in the following directories:

  unistr     elementary string functions
  uniconv    conversion from/to legacy encodings
  unistdio   formatted output to strings
  uniname    character names
  uniwidth   string width when using nonproportional fonts
  unilbrk    line breaking algorithm
  unictype   character classification and properties
  --
  unicase    case folding
  unicomp    composition and decomposition
  uniregex   regular expressions
  unibidi    bidirectional reordering (use FriBidi in the meantime)

The last four are planned, not yet implemented.

Copyright is FSF and LGPL, as usual.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]