bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling wh

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling wh

From:	Eli Zaretskii
Subject:	bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings
Date:	Tue, 04 Oct 2016 10:12:37 +0300

> From: Michal Nazarewicz <mina86@mina86.com>
> Date: Tue,  4 Oct 2016 03:10:33 +0200
> 
> +
> +  /* FIXME: Is current-iso639-language the best source of that information? 
> */
> +  lang = Vcurrent_iso639_language;
> +  tr = intern_c_string ("tr");
> +  az = intern_c_string ("az");
> +  if (SYMBOLP (lang))
> +    {
> +      l = lang;
> +      goto check_language;
> +    }
> +  while (CONSP (lang))
> +    {
> +      l = XCAR (lang);
> +      lang = XCDR (lang);
> +    check_language:
> +      if (EQ (l, tr) || EQ (l, az))
> +       {
> +         ctx->treat_turkic_i = true;
> +         break;
> +       }
> +    }

I'm not sure I like this mechanism.  AFAIU, current-iso639-language is
a read-only variable that conveys the outside locale's language.  So
the above would limit this feature to users in the corresponding
locales, which is against Emacs's design as a multilingual system.  We
should allow Lisp applications and users in _any_ locale take
advantage of this feature.

So I suggest a separate variable which, when non-nil, will cause these
conversions to take effect.  Lisp applications could then bind that
variable when they want these special conversions.  (With the eye
towards future developments, as hinted by the rest of Unicode's
SpecialCasing.txt file, perhaps don't make the variable's name mention
a specific language, but instead make its value a language symbol,
such as 'tr or 'az.)  We could make it a defcustom, if we think users
will want to turn this on as their default.

> +/* Normalise CFG->flag and return CASE_UP, CASE_DOWN, CASE_CAPITALIZE or
      ^^^^^^^^^
A nit: we use US English spelling, so "Normalize".

> +static enum case_action
> +normalise_flag (struct casing_context *ctx)
   ^^^^^^^^^
Likewise.

> +{
> +  /* Normalise flag so its one of CASE_UP, CASE_DOWN or CASE_CAPITALIZE. */

This comment repeats what was already said above.

>  /* In Greek, lower case sigma has two forms: one when used in the middle and 
> one
> @@ -152,6 +192,13 @@ case_character_impl (struct casing_str_buf *buf,
>  #define CAPITAL_SIGMA     0x03A3
>  #define SMALL_SIGMA       0x03C3
>  #define SMALL_FINAL_SIGMA 0x03C2
> +
> +/* Azeri and Turkish have dotless and dotted i.  An upper case of i is
> +   İ while lower case of I is ı. */
> +
> +#define CAPITAL_DOTTED_I    0x130
> +#define SMALL_DOTLESS_I     0x131
> +#define COMBINING_DOT_ABOVE 0x307

How about deriving these rules from SpecialCasing.txt and storing them
in some char-table, instead of hard-coding them in C?  That would
allow us to update these features more easily with each release of the
Unicode Standard.

> +  if (flag != CASE_NO_ACTION && __builtin_expect(ctx->treat_turkic_i, false))

I don't think we can use __builtin_expect here, it's AFAIK
non-portable to any platform without glibc.

> +      if (len_bytes > 0)
> +     src += len_bytes;
> +      size -= len_bytes > 0 ? 2 : 1;

Another nit: please use whitespace consistently in the indentation,
either all TABs and spaces, or just spaces.  (I think our default is
the former for now.)

Thanks.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data, (continued)
- bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 3/3] Don’t generate ‘X maps to X’ entries in case tables, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 1/3] Add tests for casefiddle.c, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 2/3] Generate upcase and downcase tables from Unicode data, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 0/3] Case table updates, Eli Zaretskii, 2016/10/18
    - bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/24

Prev by Date: bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties
Next by Date: bug#24109: 24.5; Long lines in message mode make Emacs irresponsive
Previous by thread: bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings
Next by thread: bug#24603: [RFC 08/18] Support casing characters which map into multiple code points
Index(es):
- Date
- Thread