bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnulib] addition: linebreak.h, linebreak.c


From: Bruno Haible
Subject: Re: [Bug-gnulib] addition: linebreak.h, linebreak.c
Date: Mon, 7 Apr 2003 17:19:13 +0200 (CEST)

Paul Eggert writes:

> > /* Determine number of column positions required for UC. */
> > extern int uc_width (unsigned int uc, const char *encoding);
> 
> Is UC a Unicode code position?  Perhaps it should be typedefed, both for
> clarity and for efficiency on weird hosts?  E.g.:
> 
>   typedef int_fast32_t unicode_char;
>   int uc_width (unicode_char uc, const char *encoding);

'uc' means a Unicode code position (a.k.a. as "Unicode character").
I have it typedefed in the libunistring package which is under
development. Until it is ready for release, I wish to keep linebreak.h
as slim as possible.

I generally assume that 'unsigned int' serves the same purpose as
'uint32_t'. Do you know a platform where 'unsigned int' isn't usable?
(Excluding 16-bit platforms!)

> Won't it be faster if we add an extra function that converts ENCODING
> to a small integer or a pointer that represents the encoding, and pass
> that small integer or pointer to uc_width instead of passing ENCODING?

If the possible encodings were a constant, stable set, I would agree.
But the crux with internationalization is that every few months,
someone adds a new encoding. And I don't want to change the header
file (-> and have everyone recompile its code) once a new encoding has
to be added.

> > /* Determine number of column positions required for first N units
> >    (or fewer if S ends before this) in S.  */
> > extern int u8_width (const unsigned char *s, size_t n, const char 
> > *encoding);
> > extern int u16_width (const unsigned short *s, size_t n, const char 
> > *encoding);
> > extern int u32_width (const unsigned int *s, size_t n, const char 
> > *encoding);
> 
> I was confused by the prefixes u8, u16, and u32.  At first I thought
> they meant "unsigned integer of width 8 bits", etc.

Yes, it's a designator for the element type of the string 's'. The
libunistring documentation will explain it better.

> How about changing the prefixes to utf8, utf16, and ucs4, respectively?

It'd possible, but what's the gain?

> Also, how about replacing
> 
> unsigned char  -> utf8_int
> unsigned short -> utf16_int
> unsigned int   -> ucs4_int
> 
> where we have:
> 
> typedef uint_least8_t utf8_int;
> typedef uint_least16_t utf16_int;
> typedef uint_least32_t ucs4_int;

I do assume that 'unsigned char' has at least 8 bits, 'unsigned short'
has at least 16 bits, and 'unsigned int' has at least 32 bits. Again,
I'm not aiming at 16-bit platforms of the 1980'ies.

> What do the functions do if the input is invalid, e.g. an octet sequence
> that is not valid UTF-8?

In the internal processing, the same as the u8_mbtouc routine does:
Substitute 0xfffd for the malformed octet [sequence] and continue
processing.

For the mbs_width_linebreaks function, the result will be that line
breaks may be inserted between consecuting bytes of a malformed octet
sequence. But the byte sequence of the multibyte string itself is not
changed. (The multibyte -> Unicode conversion is not followed by the
reverse conversion Unicode -> multibyte. Therefore the 0xfffd
characters will not be perceived as such.)

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]