bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII

From:	Michal Nazarewicz
Subject:	bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case
Date:	Tue, 4 Oct 2016 03:10:39 +0200

Use a lookup table to map Unicode general categories to character
categories.  This generalises lowercasep, uppercasep et al. functions.

Furthermore, provide another lookup table for ASCII characters such that
the common case can be optimised and Unicode general category lookup
avoided.

Using lookup table in place of conditions may theoretically improve
performance even though I have not measure it.  Moreover though, having
the lookup table will allow regex engine to be optimised in an upcoming
patch.  Stay tuned. ;)

* src/character.c (alphanumericp, alphabeticp, uppercasep, lowercasep,
graphicp, printablep): Replaced with static inline functions define in
the header file.
(category_char_bits): New lookup table mapping Unicode
general category to character classes.
(ascii_char_bits): New lookup table mapping ASCII characters to
character classes.

* src/character.h (unicode_alphanumericp, unicode_alphabeticp,
unicode_uppercasep, unicode_lowercasep, unicode_graphicp,
unicode_printablep, _ascii_alphanumericp, _ascii_alphabeticp,
_ascii_uppercasep, _ascii_lowercasep, _ascii_graphicp,
_ascii_printablep): New static inline functions which are special cases
of respective unprefixed functions.

* src/regex.c (ISALNUM, ISALPHA): Remove special cases for ASCII
characters since alphanumericp and uppercasep already handle those.
---
 src/character.c | 168 +++++++++++++++++++++++++-------------------------------
 src/character.h |  77 +++++++++++++++++++++++---
 src/regex.c     |  20 ++-----
 3 files changed, 151 insertions(+), 114 deletions(-)

diff --git a/src/character.c b/src/character.c
index 707ae10..63f89d3 100644
--- a/src/character.c
+++ b/src/character.c
@@ -960,104 +960,88 @@ character is not ASCII nor 8-bit character, an error is 
signaled.  */)
   return make_number (c);
 }
 
-static unicode_category_t
+/* Return C’s Unicode general category (or UNICODE_CATEGORY_UNKNOWN). */
+unicode_category_t
 char_unicode_category (int c)
 {
   Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
   return INTEGERP (category) ? XINT (category) : UNICODE_CATEGORY_UNKNOWN;
 }
 
-/* Return true if C is a upper case character.  This does not imply mean it
-   has a lower case form. */
-bool
-uppercasep (int c)
-{
-  unicode_category_t gen_cat = char_unicode_category (c);
-
-  /* See UTS #18.  There are additional characters that should be
-     here, those designated as Other_uppercase; FIXME.  */
-  return gen_cat == UNICODE_CATEGORY_Lu;
-}
-
-/* Return true if C is a lower case character.  This does not imply mean it
-   has an upper case form. */
-bool
-lowercasep (int c)
-{
-  unicode_category_t gen_cat = char_unicode_category (c);
-
-  /* See UTS #18.  There are additional characters that should be
-     here, those designated as Other_lowercase; FIXME.  */
-  return gen_cat == UNICODE_CATEGORY_Ll;
-}
-
-/* Return true if C is an alphabetic character.  */
-bool
-alphabeticp (int c)
-{
-  unicode_category_t gen_cat = char_unicode_category (c);
-
-  /* See UTS #18.  There are additional characters that should be
-     here, those designated as Other_uppercase, Other_lowercase,
-     and Other_alphabetic; FIXME.  */
-  return (gen_cat == UNICODE_CATEGORY_Lu
-         || gen_cat == UNICODE_CATEGORY_Ll
-         || gen_cat == UNICODE_CATEGORY_Lt
-         || gen_cat == UNICODE_CATEGORY_Lm
-         || gen_cat == UNICODE_CATEGORY_Lo
-         || gen_cat == UNICODE_CATEGORY_Mn
-         || gen_cat == UNICODE_CATEGORY_Mc
-         || gen_cat == UNICODE_CATEGORY_Me
-         || gen_cat == UNICODE_CATEGORY_Nl);
-}
-
-/* Return true if C is a alphabetic or decimal-number character.  */
-bool
-alphanumericp (int c)
-{
-  unicode_category_t gen_cat = char_unicode_category (c);
-
-  /* See UTS #18.  Same comment as for alphabeticp applies.  FIXME. */
-  return (gen_cat == UNICODE_CATEGORY_Lu
-         || gen_cat == UNICODE_CATEGORY_Ll
-         || gen_cat == UNICODE_CATEGORY_Lt
-         || gen_cat == UNICODE_CATEGORY_Lm
-         || gen_cat == UNICODE_CATEGORY_Lo
-         || gen_cat == UNICODE_CATEGORY_Mn
-         || gen_cat == UNICODE_CATEGORY_Mc
-         || gen_cat == UNICODE_CATEGORY_Me
-         || gen_cat == UNICODE_CATEGORY_Nl
-         || gen_cat == UNICODE_CATEGORY_Nd);
-}
-
-/* Return true if C is a graphic character.  */
-bool
-graphicp (int c)
-{
-  unicode_category_t gen_cat = char_unicode_category (c);
-
-  /* See UTS #18.  */
-  return (!(gen_cat == UNICODE_CATEGORY_UNKNOWN
-           || gen_cat == UNICODE_CATEGORY_Zs /* space separator */
-           || gen_cat == UNICODE_CATEGORY_Zl /* line separator */
-           || gen_cat == UNICODE_CATEGORY_Zp /* paragraph separator */
-           || gen_cat == UNICODE_CATEGORY_Cc /* control */
-           || gen_cat == UNICODE_CATEGORY_Cs /* surrogate */
-           || gen_cat == UNICODE_CATEGORY_Cn)); /* unassigned */
-}
-
-/* Return true if C is a printable character.  */
-bool
-printablep (int c)
-{
-  unicode_category_t gen_cat = char_unicode_category (c);
-
-  /* See UTS #18.  */
-  return (!(gen_cat == UNICODE_CATEGORY_UNKNOWN
-           || gen_cat == UNICODE_CATEGORY_Cc /* control */
-           || gen_cat == UNICODE_CATEGORY_Cs /* surrogate */
-           || gen_cat == UNICODE_CATEGORY_Cn)); /* unassigned */
-}
+#define CHAR_BIT_ALNUM_ CHAR_BIT_ALNUM | CHAR_BIT_GRAPH | CHAR_BIT_PRINT
+#define CHAR_BIT_ALPHA_ CHAR_BIT_ALPHA | CHAR_BIT_ALNUM_
+
+/* See UTS #18 and DerivedCoreProperties.txt.  alpha, alnum, upper and
+   lower are missing some characters, namely those designated as
+   Other_uppercase, Other_lowercase and Other_alphabetic; FIXME.  */
+
+const unsigned char category_char_bits[] = {
+  [UNICODE_CATEGORY_UNKNOWN] = 0,
+  [UNICODE_CATEGORY_Lu] = CHAR_BIT_ALPHA_ | CHAR_BIT_UPPER,
+  [UNICODE_CATEGORY_Ll] = CHAR_BIT_ALPHA_ | CHAR_BIT_LOWER,
+  [UNICODE_CATEGORY_Lt] = CHAR_BIT_ALPHA_,
+  [UNICODE_CATEGORY_Lm] = CHAR_BIT_ALPHA_,
+  [UNICODE_CATEGORY_Lo] = CHAR_BIT_ALPHA_,
+  [UNICODE_CATEGORY_Mn] = CHAR_BIT_ALPHA_,
+  [UNICODE_CATEGORY_Mc] = CHAR_BIT_ALPHA_,
+  [UNICODE_CATEGORY_Me] = CHAR_BIT_ALPHA_,
+  [UNICODE_CATEGORY_Nd] = CHAR_BIT_ALNUM_,
+  [UNICODE_CATEGORY_Nl] = CHAR_BIT_ALPHA_,
+  [UNICODE_CATEGORY_No] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Pc] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Pd] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Ps] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Pe] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Pi] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Pf] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Po] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Sm] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Sc] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Sk] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_So] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Zs] = CHAR_BIT_PRINT,
+  [UNICODE_CATEGORY_Zl] = CHAR_BIT_PRINT,
+  [UNICODE_CATEGORY_Zp] = CHAR_BIT_PRINT,
+  [UNICODE_CATEGORY_Cc] = 0,
+  [UNICODE_CATEGORY_Cf] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Cs] = 0,
+  [UNICODE_CATEGORY_Co] = CHAR_BIT_PRINT | CHAR_BIT_GRAPH,
+  [UNICODE_CATEGORY_Cn] = 0,
+};
+
+#undef CHAR_BIT_ALNUM_
+#undef CHAR_BIT_ALPHA_
+
+#define P_ CHAR_BIT_PRINT
+#define G_ CHAR_BIT_GRAPH | P_
+#define N_ CHAR_BIT_ALNUM | G_
+#define U_ CHAR_BIT_UPPER | CHAR_BIT_ALPHA | N_
+#define L_ CHAR_BIT_LOWER | CHAR_BIT_ALPHA | N_
+
+const unsigned char ascii_char_bits[] = {
+/*\0  ...                                                     \17 */
+   0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+/*\20  ...                                                    \37 */
+   0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+/*' ' '!' '"' '#' '$' '%' '&' '´' '(' ')' '*' '+' ',' '-' '.' '/' */
+  P_, G_, G_, G_, G_, G_, G_, G_, G_, G_, G_, G_, G_, G_, G_, G_,
+/*'0' '1' '2' '3' '4' '5' '6' '7' '8' '9' ':' ';' '<' '=' '>' '?' */
+  N_, N_, N_, N_, N_, N_, N_, N_, N_, N_, G_, G_, G_, G_, G_, G_,
+/*'@' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' */
+  G_, U_, U_, U_, U_, U_, U_, U_, U_, U_, U_, U_, U_, U_, U_, U_,
+/*'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z' '[' '\' ']' '^' '_' */
+  U_, U_, U_, U_, U_, U_, U_, U_, U_, U_, U_, G_, G_, G_, G_, G_,
+/*'`' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' */
+  G_, L_, L_, L_, L_, L_, L_, L_, L_, L_, L_, L_, L_, L_, L_, L_,
+/*'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' '{' '|' '}' '~' \177 */
+  L_, L_, L_, L_, L_, L_, L_, L_, L_, L_, L_, G_, G_, G_, G_,  0,
+};
+
+#undef P_
+#undef G_
+#undef N_
+#undef U_
+#undef L_
 
 void
 syms_of_character (void)
diff --git a/src/character.h b/src/character.h
index 5931c5c..6dc95ad 100644
--- a/src/character.h
+++ b/src/character.h
@@ -652,8 +652,78 @@ typedef enum {
   UNICODE_CATEGORY_Cs,
   UNICODE_CATEGORY_Co,
   UNICODE_CATEGORY_Cn
+  /* Don’t forget to extend category_char_bits in character.c when new entries
+     are added here. */
 } unicode_category_t;
 
+extern unicode_category_t char_unicode_category (int);
+
+/* Limited set of character categories which syntax-independent.  Testing of
+ * those characters do not require any run-time data, e.g. do not depend on
+ * syntax table. */
+#define CHAR_BIT_ALNUM        (1 << 0)
+#define CHAR_BIT_ALPHA        (1 << 1)
+#define CHAR_BIT_UPPER        (1 << 2)
+#define CHAR_BIT_LOWER        (1 << 3)
+#define CHAR_BIT_GRAPH        (1 << 4)
+#define CHAR_BIT_PRINT        (1 << 5)
+
+/* Map from Unicode general category to character classes the character is in.
+ *
+ * Only character classes defined by CHAR_BIT_* above are present.
+ *
+ * This is an array of bit fields so for example ‘category_char_bits[gc] &
+ * CHAR_BIT_ALPHA’ tells you whether characters in general category GC are
+ * alphabetic or not. */
+extern const unsigned char category_char_bits[];
+
+/* Map from ASCII character to character classes the character is in.
+ *
+ * Only character classes defined by CHAR_BIT_* above are present.
+ *
+ * This is an array of bit fields so for example ascii_char_bits[ch] &
+ * CHAR_BIT_ALPHA’ tells you whether character CH is alphabetic or not. */
+extern const unsigned char ascii_char_bits[128];
+
+#define DEFINE_CHAR_TEST(name, bit)                                    \
+  static inline bool unicode_ ## name (int c) {                        \
+    return category_char_bits[char_unicode_category(c)] & bit;         \
+  }                                                                    \
+  static inline bool _ascii_ ## name (int c) {                         \
+    return ascii_char_bits[c] & bit;                                   \
+  }                                                                    \
+  static inline bool name (int c) {                                    \
+    return (unsigned)c < (sizeof ascii_char_bits / sizeof *ascii_char_bits) ? \
+      _ascii_ ## name (c) : unicode_ ## name (c);                      \
+  }
+
+/* For TEST being one of:
+     alphanumericp
+     alphabeticp
+     uppercasep
+     lowercasep
+     graphicp
+     printablep
+   define
+     bool TEST (int c);
+     bool unicode_TEST (int c);
+     bool _ascii_TEST (int c);
+   which test whether C has given character property.  TEST works for any
+   character, Unicode or not.  unicode_TEST works for any character as well but
+   is potentially slower for ASCII characters (since it requires Unicode
+   category lookup).  _ascii_TEST works for ASCII characters only and creates
+   naked singularity if non-ASCII character is passed to it. */
+
+DEFINE_CHAR_TEST (alphanumericp, CHAR_BIT_ALNUM)
+DEFINE_CHAR_TEST (alphabeticp, CHAR_BIT_ALPHA)
+DEFINE_CHAR_TEST (uppercasep, CHAR_BIT_UPPER)
+DEFINE_CHAR_TEST (lowercasep, CHAR_BIT_LOWER)
+DEFINE_CHAR_TEST (graphicp, CHAR_BIT_GRAPH)
+DEFINE_CHAR_TEST (printablep, CHAR_BIT_PRINT)
+
+#undef DEFINE_CHAR_TEST
+
+
 extern EMACS_INT char_resolve_modifier_mask (EMACS_INT) ATTRIBUTE_CONST;
 extern int char_string (unsigned, unsigned char *);
 extern int string_char (const unsigned char *,
@@ -676,13 +746,6 @@ extern ptrdiff_t lisp_string_width (Lisp_Object, ptrdiff_t,
 extern Lisp_Object Vchar_unify_table;
 extern Lisp_Object string_escape_byte8 (Lisp_Object);
 
-extern bool uppercasep (int);
-extern bool lowercasep (int);
-extern bool alphabeticp (int);
-extern bool alphanumericp (int);
-extern bool graphicp (int);
-extern bool printablep (int);
-
 /* Return a translation table of id number ID.  */
 #define GET_TRANSLATION_TABLE(id) \
   (XCDR (XVECTOR (Vtranslation_table_vector)->contents[(id)]))
diff --git a/src/regex.c b/src/regex.c
index 1917a84..02da1fb 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -313,6 +313,11 @@ enum syntaxcode { Swhitespace = 0, Sword = 1, Ssymbol = 2 
};
 
 /* The rest must handle multibyte characters.  */
 
+# define ISALNUM(c) alphanumericp (c)
+# define ISALPHA(c) alphabeticp (c)
+# define ISUPPER(c) uppercasep (c)
+# define ISLOWER(c) lowercasep (c)
+
 # define ISGRAPH(c) (SINGLE_BYTE_CHAR_P (c)                            \
                     ? (c) > ' ' && !((c) >= 0177 && (c) <= 0240)       \
                     : graphicp (c))
@@ -321,19 +326,6 @@ enum syntaxcode { Swhitespace = 0, Sword = 1, Ssymbol = 2 
};
                    ? (c) >= ' ' && !((c) >= 0177 && (c) <= 0237)       \
                     : printablep (c))
 
-# define ISALNUM(c) (IS_REAL_ASCII (c)                 \
-                   ? (((c) >= 'a' && (c) <= 'z')       \
-                      || ((c) >= 'A' && (c) <= 'Z')    \
-                      || ((c) >= '0' && (c) <= '9'))   \
-                   : alphanumericp (c))
-
-# define ISALPHA(c) (IS_REAL_ASCII (c)                 \
-                   ? (((c) >= 'a' && (c) <= 'z')       \
-                      || ((c) >= 'A' && (c) <= 'Z'))   \
-                   : alphabeticp (c))
-
-# define ISLOWER(c) lowercasep (c)
-
 # define ISPUNCT(c) (IS_REAL_ASCII (c)                         \
                    ? ((c) > ' ' && (c) < 0177                  \
                       && !(((c) >= 'a' && (c) <= 'z')          \
@@ -343,8 +335,6 @@ enum syntaxcode { Swhitespace = 0, Sword = 1, Ssymbol = 2 };
 
 # define ISSPACE(c) (SYNTAX (c) == Swhitespace)
 
-# define ISUPPER(c) uppercasep (c)
-
 # define ISWORD(c) (SYNTAX (c) == Sword)
 
 #else /* not emacs */
-- 
2.8.0.rc3.226.g39d4020

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24603: [RFC 01/18] Add tests for casefiddle.c, (continued)
- bug#24603: [RFC 01/18] Add tests for casefiddle.c, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 05/18] Introduce case_character function, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 06/18] Add support for title-casing letters, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 13/18] Add some tricky Unicode characters to regex test, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties, Michal Nazarewicz, 2016/10/03
    - bug#24603: [RFC 15/18] Base lower- and upper-case tests on Unicode properties, Eli Zaretskii, 2016/10/04
  - bug#24603: [RFC 04/18] Split casify_object into multiple functions, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 03/18] Don’t assume character can be either upper- or lower-case when casing, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 12/18] Implement rules for title-casing Dutch ij ‘letter’, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 11/18] Implement casing rules for Lithuanian, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case, Michal Nazarewicz <=
    - bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case, Eli Zaretskii, 2016/10/04
    - bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case, Michal Nazarewicz, 2016/10/17
  - bug#24603: [RFC 09/18] Implement special sigma casing rule, Michal Nazarewicz, 2016/10/03
    - bug#24603: [RFC 09/18] Implement special sigma casing rule, Eli Zaretskii, 2016/10/04
  - bug#24603: [RFC 14/18] Factor out character category lookup to separate function, Michal Nazarewicz, 2016/10/03
  - bug#24603: [RFC 07/18] Split up casify_region function., Michal Nazarewicz, 2016/10/03
    - bug#24603: [RFC 07/18] Split up casify_region function., Eli Zaretskii, 2016/10/04
    - bug#24603: [RFC 07/18] Split up casify_region function., Michal Nazarewicz, 2016/10/17
  - bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data, Michal Nazarewicz, 2016/10/03
    - bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data, Eli Zaretskii, 2016/10/04

Prev by Date: bug#24603: [RFC 11/18] Implement casing rules for Lithuanian
Next by Date: bug#24603: [RFC 09/18] Implement special sigma casing rule
Previous by thread: bug#24603: [RFC 11/18] Implement casing rules for Lithuanian
Next by thread: bug#24603: [RFC 16/18] Refactor character class checking; optimise ASCII case
Index(es):
- Date
- Thread