[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: uniq i18n implementation
From: |
Pádraig Brady |
Subject: |
Re: uniq i18n implementation |
Date: |
Thu, 10 Aug 2006 22:24:42 +0100 |
User-agent: |
Mozilla Thunderbird 1.0.8 (X11/20060502) |
Paul Eggert wrote:
>>>>Using strcoll is inefficient anyway
>>>
>>>Don't we know it! If we can avoid it, we'd like to.
>>
>>Well, the mbstowcs+wcscoll solution I presented
>>should be equivalent to strcoll on any platform,
>>and it's much faster in my tests.
>
>
> That's good to know, though I'm puzzled as to why it's true. For a
> single comparison, can't strcoll typically return an answer without
> examining all the input, and wouldn't that be faster than
> mbstowc+wcscoll?
>
> But if it is true, perhaps we should rewrite memcoll to use the
> mbstowc+wcscoll combination as well.
I missed out a test case in my performance runs
for same length lines with random data
(where strcoll can break out early).
I'll run that and comment more.
I was also using the string length comparison
shortcut on the wide string. I'm unsure whether
this is valid (on all platforms).
>>>>but it probably is possible in ICU?
>>>
>>>Sorry, don't know.
>>
>>I wonder could we add this as a dependency?
>
>
> You mean, ship ICU code? Or depend on it already being installed?
probably ship it?
> Sorry, I'm not familiar with the ICU code. Is it free software and is
> it well maintained? Where else is it being used, outside ICU itself?
I am not familiar with it myself, but note
it's used for various things in python, mozilla, openoffice, ...
>>Also I don't agree with splitting entities into
>>valid multibyte ranges and "C" for the rest.
>>That is probably not what the user wants the data interpreted as,
>>and I think (at least for uniq which I've thought about),
>>that it's just best to treat the whole entity as "C"
>>if there are invalid multibyte sequences in the entity.
>
>
> We can't adopt this approach in general, since it would mean that our
> comparison operation could return inconsistent answers. Suppose "Y"
> has an invalid byte sequence but "X" and "Z" are valid. Then we might
> have "X" < "Y" < "Z" (using C-locale comparison), but "Z" < "X" (using
> some other locale's comparison). This will lead to inconsistencies,
> which will be hard to document and will confuse users.
Garbage In Garbage Out.
As for confusing users my solution was to print
a warning indicating the invalid input.
> Worse, it can
> even lead to buffer overruns: e.g., qsort has undefined behavior if
> you pass it a comparison function that is not a total order.
Thanks for pointing that out.
I'll look into that.
cheers,
Pádraig.