emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: commit-msg hook


From: Eli Zaretskii
Subject: Re: commit-msg hook
Date: Tue, 14 Apr 2015 21:01:45 +0300

> Date: Tue, 14 Apr 2015 10:42:53 -0700
> From: Paul Eggert <address@hidden>
> CC: address@hidden
> 
> How about this idea?  Before falling back to the unibyte regular 
> expressions in awk, set LC_ALL='C' in the environment.  This should work 
> well enough, as in practice all environments where the C locale is 
> multibyte have working UTF-8 so they won't need to fall back to unibyte 
> anyway.

You mean, like below?

--- ./.git/hooks/commit-msg.~5~ 2015-04-12 19:11:27.481125000 +0300
+++ ./.git/hooks/commit-msg     2015-04-14 21:01:14.481125000 +0300
@@ -37,6 +37,8 @@
   at_sign=`LC_ALL=en_US.UTF-8 $awk "$print_at_sign" </dev/null 2>/dev/null`
   if test "$at_sign" = @; then
     LC_ALL=en_US.UTF-8; export LC_ALL
+  else
+    LC_ALL=C; export LC_ALL
   fi
 fi
 
@@ -45,10 +47,13 @@
   BEGIN {
     # These regular expressions assume traditional Unix unibyte behavior.
     # They are needed for old or broken versions of awk, e.g.,
-    # mawk 1.3.3 (1996), or gawk on MSYS (2015).
+    # mawk 1.3.3 (1996), or gawk on MSYS (2015), and/or for systems that
+    # cannot use UTF-8 as the codeset for the locale.
     space = "[ \f\n\r\t\v]"
     non_space = "[^ \f\n\r\t\v]"
-    non_print = "[\1-\37\177]"
+    # The non_print below rejects control characters and surrogates
+    # UTF-8 for: 0x01-0x1f 0x7f   0x80-0x9f    0xd800-0xdbff     0xdc00-0xdfff
+    non_print = 
"[\1-\37\177]|\302[\200-\237]|\355([\240-\257]|[\260-\277])[\200-\277]"
 
     # Prefer POSIX regular expressions if available, as they do a
     # better job of checking.  Similarly, prefer POSIX negated



reply via email to

[Prev in Thread] Current Thread [Next in Thread]