coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[coreutils] Re: [PATCH] join: support multi-byte character encodings


From: Bruno Haible
Subject: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Date: Sun, 19 Sep 2010 15:13:38 +0200
User-agent: KMail/1.9.9

Hi Pádraig,

> This is my start at applying robust and efficient multi-byte
> processing to coreutils.

Actually, it is the continuation of the discussion and based on the patch
from March 2009
<http://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html>.

What has changed since then?

Ad 1) Which functions to use for case comparison in coreutils?
I see you replaced the calls to ulc_casecmp / mbmemcasecmp with calls to
ulc_casecoll / mbmemcasecoll. This is correct because POSIX specifies that
the sorting should obey the LC_COLLATE environment variable, i.e. use locale
dependent collation rules.

Ad 2) How should the case comparison in "sort -f" behave?
I think you need to decide on this before changing 'join', because 'join'
and 'sort' need to behave the same way, otherwise 'join' errs out when
processing results from 'sort'.

Ad 3) Executable size.
This has now been resolved through the creation of libunistring as a shared
library.


About multibyte processing vs. UTF-8 processing:
> I was wondering about unconditionally converting
> input to UTF-8 before processing. I didn't do this yet as:
>   Currently we support invalid input chars by processing as in the C locale.

Correct. This is one of the major design decisions Paul, Jim, and I agreed upon
in 2001. It is this requirement which forbids converting the input to a wchar_t
stream, doing processing with wchar_t objects, and producing a stream of wchar_t
objects that are finally converted to multibyte representation again.

It is this requirement which also forbids converting the input to UTF-8, doing
processing with Unicode characters, and converting the Unicode character stream
to multibyte representation at the end. This approach is acceptable for a word
processor that can refuse to open a file, or for more general applications.
But for coreutils, where classical behaviour is to get reasonable processing in
the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
cannot be done.

For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 'mbfile',
which provide a "multibyte character" datatype that accommodates also invalid
byte sequences.

Emacs handles this requirement by extending UTF-8. But this approach is unique
to Emacs: libunistring and other software support plain UTF-8, not extended
UTF-8.

> wanted to transform the input as little as possible.

Yes, this is a wise guideline.

>   I was unsure how to generally correlate portions of a UTF-8 string with its
>   corresponding source string

The u8_conv_from_encoding function
<http://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html>
stores offsets in an array that give you a mapping from the source string
pointers to the UTF-8 string pointers.

> I may revisit the idea of processing using UTF8 internally if we support
> normalization internally to the utils.

I would advise against normalization that forgets the original string. Of
course, you can associate the original string with the UTF-8 equivalent, if
that exists, and keep both in memory; that becomes merely a speed vs. memory
trade-off.

> I do try to special case UTF-8 where beneficial

I think this should be done in an encoding agnostic way. Don't test whether
the locale encoding is "UTF-8". This is bad, because
  - The GB18030 encoding is just as powerful as UTF-8; it's just a different
    encoding.
  - The same handling of the Turkish i should be done in ISO-8859-9 locales
    as in UTF-8 locales. The same handling of the Greek sigma should be done
    in ISO-8859-7 locales as in UTF-8 locales. Users would be very surprised
    if moving from unibyte encodings to UTF-8 would change the semantic of
    processing their text.

> One consequence of linking with libunistring ... printf ...
> incurs a 16% startup overhead

We discussed this in the thread at
<http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00000.html>.

> On a related note, I noticed that bootstrap doesn't update ./configure 
> correctly

See <http://lists.gnu.org/archive/html/bug-gnulib/2010-09/msg00090.html>.


Now, going through your patch.

- Function xfields. I agree with the use of mbiter for the multibyte case and of
  'char' variables for the unibyte case. But the algorithm is duplicated, once
  for MB_CUR_MAX > 1, once for MB_CUR_MAX == 1. In 2001, Paul and Jim said that
  they wanted
    - correct handling of multibyte locales, even with invalid input,
    - no speed decrease for unibyte locales,
    - no code duplication,
    - maintainable code.
  At that time I wrote macros which expand to uses of 'mbiter' or plain 'char'
  variables, depending on macros. See attached diff. At that time, Jim did not
  like the extra file. Two months ago, he told me an extra file would be
  acceptable if we really find no better solution.

- Function xfields. The "FIXME: ... when splitting on non blanks, Surrounding
  chars might be needed for context". I would say in this case, context should
  be ignored. If someone specifies to split at Greek sigmas, he certainly does
  wants to consider all Greek sigmas, not some of them depending on their
  context. So I would turn this "FIXME" into a note.

- Function keycmp. For speed, it should be useful to determine
  uc_locale_language () once only.

- Function main, handling of 't' argument. Let's be clear about what POSIX
  says <http://www.opengroup.org/onlinepubs/9699919799/utilities/join.html>:
  The 't' option takes a single character as argument, and the LC_CTYPE category
  of the current locale determines the boundaries of characters.
  You _can_ allow "combined characters" as 't' arguments, but that is an
  extension to POSIX, so one should make sure that
    1. this extension does not cause regression to the POSIX part of the
       functionality,
    2. it is turned off when the environment variable POSIXLY_CORRECT is set.

+#if HAVE_LIBUNISTRING
+            else
+              {
+                if (!(u8tab = u8_str_from_locale (optarg)))
+                  error (EXIT_FAILURE, errno, _("error converting tab %s"),
+                         quote (optarg));
+              }
+#endif

  IMO, if the conversion cannot be performed or libunistring is not present,
  you should use mbrtowc to test whether optarg contains only one character.
  Or equivalently, test    mbslen (optarg) > 1.

+#if HAVE_LIBUNISTRING
+                    /* We support only single characters to support existing
+                       documentation, and to restrict possible future character
+                       combinations, like the current "\0".  */
+                    if (u8_base_chars (u8tab) > 1)
+#endif
+                      error (EXIT_FAILURE, 0, _("multi-character tab %s"),
+                             quote (optarg));

  Putting an if condition inside #if is confusing. I would avoid that style.
  And, as said above, better consider POSIXLY_CORRECT here.

- Comments in function u8_base_chars: The POSIX notion of "character" is a
  multibyte character. When we convert multibyte characters to Unicode
  characters, it is possible to map one POSIX character to 2 Unicode characters.
  Your function u8_base_chars returns the number of so-called
  "complex characters"
  <http://opengroup.org/onlinepubs/007908775/xcurses/intov.html>.
  It appears correct to use UC_PROPERTY_COMBINING here (as opposed to just the
  combining class).

About the unit test: I would also add some tests with the EUC-JP encoding.
It is so easy to assume that every 'const char *' is UTF-8 encoded, especially
for future generation of programmers who did not grow up with ISO-8859-1, and
especially when libunistring is in the minds of the programmers.

Bruno

Attachment: join-bruno2001.diff
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]