Re: uniq i18n implementation

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uniq i18n implementation

From:	Pádraig Brady
Subject:	Re: uniq i18n implementation
Date:	Thu, 10 Aug 2006 22:24:42 +0100
User-agent:	Mozilla Thunderbird 1.0.8 (X11/20060502)

Paul Eggert wrote:
>>>>Using strcoll is inefficient anyway
>>>
>>>Don't we know it!  If we can avoid it, we'd like to.
>>
>>Well, the mbstowcs+wcscoll solution I presented
>>should be equivalent to strcoll on any platform,
>>and it's much faster in my tests.
> 
> 
> That's good to know, though I'm puzzled as to why it's true.  For a
> single comparison, can't strcoll typically return an answer without
> examining all the input, and wouldn't that be faster than
> mbstowc+wcscoll?
> 
> But if it is true, perhaps we should rewrite memcoll to use the
> mbstowc+wcscoll combination as well.

I missed out a test case in my performance runs
for same length lines with random data
(where strcoll can break out early).
I'll run that and comment more.

I was also using the string length comparison
shortcut on the wide string. I'm unsure whether
this is valid (on all platforms).

>>>>but it probably is possible in ICU?
>>>
>>>Sorry, don't know.
>>
>>I wonder could we add this as a dependency?
> 
> 
> You mean, ship ICU code?  Or depend on it already being installed?

probably ship it?

> Sorry, I'm not familiar with the ICU code.  Is it free software and is
> it well maintained?  Where else is it being used, outside ICU itself?

I am not familiar with it myself, but note
it's used for various things in python, mozilla, openoffice, ...

>>Also I don't agree with splitting entities into
>>valid multibyte ranges and "C" for the rest.
>>That is probably not what the user wants the data interpreted as,
>>and I think (at least for uniq which I've thought about),
>>that it's just best to treat the whole entity as "C"
>>if there are invalid multibyte sequences in the entity.
> 
> 
> We can't adopt this approach in general, since it would mean that our
> comparison operation could return inconsistent answers.  Suppose "Y"
> has an invalid byte sequence but "X" and "Z" are valid.  Then we might
> have "X" < "Y" < "Z" (using C-locale comparison), but "Z" < "X" (using
> some other locale's comparison).  This will lead to inconsistencies,
> which will be hard to document and will confuse users.

Garbage In Garbage Out.
As for confusing users my solution was to print
a warning indicating the invalid input.

> Worse, it can
> even lead to buffer overruns: e.g., qsort has undefined behavior if
> you pass it a comparison function that is not a total order.

Thanks for pointing that out.
I'll look into that.

cheers,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: uniq i18n implementation, Paul Eggert, 2006/08/08
- Re: uniq i18n implementation, Pádraig Brady, 2006/08/09
  - Re: uniq i18n implementation, Paul Eggert, 2006/08/10
    - Re: uniq i18n implementation, Pádraig Brady <=
    - Re: uniq i18n implementation, Paul Eggert, 2006/08/10
    - Re: uniq i18n implementation, Pádraig Brady, 2006/08/14
    - Re: uniq i18n implementation, Paul Eggert, 2006/08/14
    - Re: uniq i18n implementation, Pádraig Brady, 2006/08/14

Prev by Date: fix for csplit core dump on Solaris 10, Sun Studio 10, 64-bit sparc
Next by Date: Re: uniq i18n implementation
Previous by thread: Re: uniq i18n implementation
Next by thread: Re: uniq i18n implementation
Index(es):
- Date
- Thread