bug#24603: [PATCHv5 05/11] Support casing characters which map into mult

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24603: [PATCHv5 05/11] Support casing characters which map into mult

From:	Eli Zaretskii
Subject:	bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603)
Date:	Sat, 11 Mar 2017 11:14:53 +0200

> From: Michal Nazarewicz <mina86@mina86.com>
> Date: Thu,  9 Mar 2017 22:51:44 +0100
> 
> Implement unconditional special casing rules defined in Unicode standard.
> 
> Among other things, they deal with cases when a single code point is
> replaced by multiple ones because single character does not exist (e.g.
> ‘ﬁ’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning
> into SS).
> 
> * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode
> standard distribution.
> * admin/unidata/README: Mention SpecialCasing.txt.
> 
> * admin/unidata/unidata-get.el (unidata-gen-table-special-casing): New
> function for generating ‘special-casing’ character Unicode property
> built from the SpecialCasing.txt Unicode data file.

This new property is attainable via get-char-code-property, right?  If
so, it should be documented in the Elisp manual, in the "Character
Properties" node.

I think I'd also like to see a few simple tests for this property.

> diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi
> index cf47db4a814..ba1cf2606ce 100644
> --- a/doc/lispref/strings.texi
> +++ b/doc/lispref/strings.texi
> @@ -1166,6 +1166,29 @@ Case Conversion
>  @end example
>  @end defun
>  
> +  Note that case conversion is not a one-to-one mapping and the length
> +of the result may differ from the length of the argument (including
> +being shorter).  Furthermore, because passing a character forces
> +return type to be a character, functions are unable to perform proper
> +substitution and result may differ compared to treating
> +a one-character string.  For example:
> +
> +@example
> +@group
> +(upcase "ﬁ")  ; note: single character, ligature "fi"
> +     @result{} "FI"
> +@end group
> +@group
> +(upcase ?ﬁ)
> +     @result{} 64257  ; i.e. ?ﬁ
> +@end group
> +@end example
> +
> +  To avoid this, a character must first be converted into a string,
> +using @code{string} function, before being passed to one of the casing
> +functions.  Of course, no assumptions on the length of the result may
> +be made.

Once the ELisp manual describes the new special-casing property, the
above text should include a cross-reference to that description.

>  DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
>         doc: /* Convert argument to upper case and return that.
>  The argument may be a character or string.  The result has the same type.
> -The argument object is not altered--the value is a copy.
> +The argument object is not altered--the value is a copy.  If argument
> +is a character, characters which map to multiple code points when
> +cased, e.g. ﬁ, are returned unchanged.
>  See also `capitalize', `downcase' and `upcase-initials'.  */)

This (and other similar doc strings) should mention the special-casing
property as the way to know in advance which characters will remain
unchanged due to that.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603), (continued)
- bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603), Michal Nazarewicz, 2017/03/09
  - bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603), Eli Zaretskii, 2017/03/11
    - bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603), Michal Nazarewicz, 2017/03/16
    - bug#24603: [PATCHv5 08/11] Implement rules for title-casing Dutch ij ‘letter’ (bug#24603), Eli Zaretskii, 2017/03/17
- bug#24603: [PATCHv5 09/11] Implement Turkic dotless and dotted i casing rules (bug#24603), Michal Nazarewicz, 2017/03/09
- bug#24603: [PATCHv5 11/11] Implement Irish casing rules (bug#24603), Michal Nazarewicz, 2017/03/09
  - bug#24603: [PATCHv5 11/11] Implement Irish casing rules (bug#24603), Eli Zaretskii, 2017/03/11
    - bug#24603: [PATCHv5 11/11] Implement Irish casing rules (bug#24603), Michal Nazarewicz, 2017/03/16
    - bug#24603: [PATCHv5 11/11] Implement Irish casing rules (bug#24603), Eli Zaretskii, 2017/03/17
- bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603), Michal Nazarewicz, 2017/03/09
  - bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603), Eli Zaretskii <=
    - bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603), Michal Nazarewicz, 2017/03/20
- bug#24603: [PATCHv5 00/11] Casing improvements, Eli Zaretskii, 2017/03/11
- bug#24603: [PATCHv6 0/6] Casing improvements, language-independent part, Michal Nazarewicz, 2017/03/20
  - bug#24603: [PATCHv6 3/6] Add support for title-casing letters (bug#24603), Michal Nazarewicz, 2017/03/20
  - bug#24603: [PATCHv6 1/6] Split casify_object into multiple functions, Michal Nazarewicz, 2017/03/20
  - bug#24603: [PATCHv6 6/6] Implement special sigma casing rule (bug#24603), Michal Nazarewicz, 2017/03/20
  - bug#24603: [PATCHv6 4/6] Split up casify_region function (bug#24603), Michal Nazarewicz, 2017/03/20
  - bug#24603: [PATCHv6 2/6] Introduce case_character function, Michal Nazarewicz, 2017/03/20
  - bug#24603: [PATCHv6 5/6] Support casing characters which map into multiple code points (bug#24603), Michal Nazarewicz, 2017/03/20
    - bug#24603: [PATCHv6 5/6] Support casing characters which map into multiple code points (bug#24603), Eli Zaretskii, 2017/03/22

Prev by Date: bug#24603: [PATCHv5 03/11] Add support for title-casing letters (bug#24603)
Next by Date: bug#24603: [PATCHv5 07/11] Introduce ‘buffer-language’ buffer-locar variable
Previous by thread: bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603)
Next by thread: bug#24603: [PATCHv5 05/11] Support casing characters which map into multiple code points (bug#24603)
Index(es):
- Date
- Thread