pika-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pika-dev] string foo


From: Tom Lord
Subject: [Pika-dev] string foo
Date: Mon, 17 May 2004 12:38:14 -0700 (PDT)

I've been reviewing the ustr patches and starting work on them.

Um... hmm... I've basically rewritten ustr.c (see
hackerlab--devo--1.0--patch-69).  It was very helpful having it
roughed out but I've reimplemented everything and tweaked it a bit.

* an api note on ustr_length_in_encoding

I changed the return value meaning of ustr_length_in_encoding and
friends.   They return -1 for a string truncated in the middle of a
character and -2 for a string that contains characters which can not
be represented in the target encoding.


* an api not on ustr_fill_*

I haven't worked on the file containing the fill functions yet but I
did have to tweak them due to an API change in ustr.c.   So, I noticed
the ustr_fill* functions.   

Those can't possibly be right.  A conceptual `fill' operation could
change the length of a unicode string in several of the encoding
forms.  But ustr_* functions can't change the length of their
argument.   I think these have to go. 


* an api note on the unit of `length' parameters

There is some awkwardness in the very low level routines (the
uni_*_iscan and uni_*_scan routines, for example).   They come 
in two flavors where the two flavors treat string lengths
differently.   For example:

        uni_utf16_scan (...)

                --- scan one character from a UTF16 string, 
                    incrementing a string index value

                    The string index is measured in coding values
                    (16-bit integers)



        uni_utf16be_iscan (...)

                --- scan one character from a (possibly unalligned)
                    UTF16be string, incrementing a string index value

                    The string index is measured in bytes.


The inconsistency actually makes sense in the low-level routines.
The iscan and iput routines are there for when you're dealing with raw
data (as in an I/O buffer) for which allignment isn't assured and most
of the surrounding code is just seeing a stream of bytes.   The scan
and put routines are for aligned unicode strings which surrounding
code is more likely to see as an array of coding values.

In the ustr_* functions, I want to:

        ~ still support explicit-endian encoding forms and
          unalligned strings

        ~ nevertheless, emphasize the string view rather than the
          byte view

Accordingly, I'm changing the ustr api so that the *_n forms of
functions accept lengths expressed in coding values rather than bytes.

That creates a slight impedence mismatch between ustr_* and the
low-level uni_* functions --- there's a need to sometimes convert
between byte-lengths and coding value lengths.


* an implementation note

Jao: use less macros.   The main goal of the
macrology-underlying-inline-functions in the low-level functions was
to avoid the need for hairy macrology in the higher level functions.   
In the rewrite of ustr.c I essentially don't use macros at all.


-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]