coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte support (round 4)


From: Assaf Gordon
Subject: Re: multibyte support (round 4)
Date: Wed, 16 Aug 2017 22:29:02 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

Hello Sebastián,

On 16/08/17 01:52 AM, Sebastian Kisela wrote:
> Is there a way we could split the work so that I could help you?

Of course - any help would be fantastic.

> Thinking of implementing unfinished multibyte cut -d option might be a good
> start maybe.

Sounds very good.

Few thoughts on "cut -d":

1.
The 'delim' variable:

The current variable 'delim' is "unsigned char".
It likely should not be "wchar_t", as that type can't contain
all possible values on few OSes that use surrogate-paris
(such as cygwin).
Another possibility is using gnulib's 'ucs4_t', but that assumes
unicode, which we can't assume (the locale could be non-unicode).
Yet another option is a string (char*) - the most versatile but also the
least efficient.

Keeping the LC_ALL=C (unibyte locale) as efficient as possible is a
priority, perhaps it's worth having a second variable (e.g. "mb_delim")?

I'm raising this possibility because the current function 'cut_fields()'
uses gnulib's optimized function getndelim2() (which reads input until
one of two delimiters: the field delimiter and/or the line delimiter).
It "just works" for multibyte input with unibyte delimiter -
but won't work for multibyte delimiter.

If anyhow a new function is required (e.g. cut_fields_mb()),
perhaps it could just use a different delimiter variable.

Lastly, the recent multibyte support in expr(1)
uses gnulib's "mbuiter" module to iterate over multibyte strings.
If 'cut_fields_mb' is written a new, perhaps using that module
could handle multibyte delimiters with ease.



2.
Invalid multibyte sequences:
It is important that coreutils programs handle invalid multibyte
sequences without failing (usually passing the input as-is).

However, perhaps it is acceptable to reject invalid multibyte sequences
in the delimiter? that would make processing slightly simpler.

Something like:
    LC_ALL=C cut -d $'\x98 -f 2   # Valid in C locale
versus:
    $ LC_ALL=en_CA.UTF-8 cut -d $'\x98 -f 2
    cut: invalid delimiter character in the current locale



Everyone is very welcomed to chime in about these or other issues.

Please do post any updates or patches,
and I'll update the web page with any further progress.

regards,
 - assaf












reply via email to

[Prev in Thread] Current Thread [Next in Thread]