[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
normalize.el
From: |
Kenichi Handa |
Subject: |
normalize.el |
Date: |
Thu, 02 Apr 2009 15:40:44 +0900 |
The attached is an Unicode normalization tool contributed by
Kawabata-san. It performs all the Unicode normalization
NFC/NFD/NFKD/NFKC, and provides a coding system utf-8-hfs
that is suitable to be used for Mac OS 8.1's file names.
The assignment paper from Kawabata-san is already arrived at
FSF.
Perhaps committing it to the trunk now is not good
considering that we are already pretesting for 23.1. What
should I do with it? At least, I want Mac users to test it
by setting the default-file-name-coding-system to
`utf-8-hfs'.
---
Kenichi Handa
address@hidden
;;; normalize.el --- Unicode normalization NFC/NFD/NFKD/NFKC
;; Copyright (C) 2009
;; Free Software Foundation, Inc.
;; Author: Taichi Kawabata <address@hidden>
;; Maintainer: Taichi Kawabata <address@hidden>
;; Keywords: unicode, normalization
;; This file is part of GNU Emacs.
;; GNU Emacs is free software: you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 3 of the License, or
;; (at your option) any later version.
;; GNU Emacs is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;; GNU General Public License for more details.
;; You should have received a copy of the GNU General Public License
;; along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>.
;;; Commentary:
;; The following code will normalize the string or text according to
;; the NFC/NFD/NFKD (single step) NFKC (dual step).
;; This program has passed the NormalizationTest-4.1.0.txt except
;; Hangul normalizations.
;; References:
;; http://www.unicode.org/reports/tr15/
;; http://www.unicode.org/review/pr-29.html
;; Furthermore, it supports HFS normalization, for people who don't
;; want to normalize the CJK characters.
;;; Note on HFS Normalization.
;;
;; HFS-Normalization is the way to normalize text, as described by HFS
;; Plus Volue format specification
;; (http://developer.apple.com/technotes/tn/tn1150.html). This is
;; useful espcially for CJK people, since it doesn't normlize CJK
;; compatibility characters.
;;
;; Normalization Exclusion Area:
;; U+02000 .. U+02FFF :: Punctuation, symbols, dingbats, arrows, etc.
;; U+0F900 .. U+0FAFF :: CJK compatibility Ideographs.
;; U+2F800 .. U+2FFFF :: CJK compatibility Ideographs.
;;;
;;; Implementation Notes on NFC/HFS-NFC. (ccc = canonical combining class)
;;
;; A. `re-search-forward' for the following characters.
;; (1) characters that should be decomposed anyway.
;; (1-a) composition-exclusion characters
;; (1-b) singleton characters
;; (1-c) characters whose decomposed first char does not have ccc=0.
;; (2) characters that may appear as second of composed character.
;; (2-a) character whose ccc != 0.
;; (2-b) character whose ccc = 0.
;;
;; B. In case of (1) in step A, translate character, go to the
;; beginning position of translated region, and go back to step A.
;;
;; C. In case of (2) in step A, normalize the block around the
;; searched character. (`normalize-block')
;;
;; (1) start of the block
;; In case of (2-a) in step A, if the previous character
;; exists, then the beginning of the block is the previous
;; character. In case of (2-b), if the previous character's
;; ccc = 0, then the beginning of the block is the previous
;; character. Otherwise, the beginning of the block is the
;; searched character.
;; (2) end of the block
;; from the searched character, the end of the block is
;; searched until the character whose ccc=0.
;;
;; D. Normalize the block
;; (1) decompose the block
;; (2) sort the combining characters
;; (3) compose the block.
;;;
;;; Implementation Notes on NFD/HFS-NFD.
;;
;; A. `re-search-forward' for the combining characters sequence.
;;
;; B. Normalize the block around the searched character sequence.
;; (1) Start of the block is the previous character if it exists.
;; If not, start of the block is the searched character.
;; (2) End of the block is the end of the combining character
;; sequence.
;;
;; C. Normalize the block
;; (1) decompose the block
;; (2) sort the combining characters
;;
;; D. Translate all the rest at once.
;;;
;;; Implementation Notes on NFKC/NFKD
;;
;; NFKC/NFKD is realized as applying NFC/NFD to the text which is
;; firstly translated by NFKD(-unique) decomposition patterns.
;;; Code:
(defconst normalize-composition-exclusions
'(#x0958 #x0959 #x095A #x095B #x095C #x095D #x095E #x095F
#x09DC #x09DD #x09DF #x0A33 #x0A36 #x0A59 #x0A5A #x0A5B
#x0A5E #x0B5C #x0B5D #x0F43 #x0F4D #x0F52 #x0F57 #x0F5C
#x0F69 #x0F76 #x0F78 #x0F93 #x0F9D #x0FA2 #x0FA7 #x0FAC
#x0FB9 #xFB1D #xFB1F #xFB2A #xFB2B #xFB2C #xFB2D #xFB2E
#xFB2F #xFB30 #xFB31 #xFB32 #xFB33 #xFB34 #xFB35 #xFB36
#xFB38 #xFB39 #xFB3A #xFB3B #xFB3C #xFB3E #xFB40 #xFB41
#xFB43 #xFB44 #xFB46 #xFB47 #xFB48 #xFB49 #xFB4A #xFB4B
#xFB4C #xFB4D #xFB4E #x2ADC #x1D15E #x1D15F #x1D160 #x1D161
#x1D162 #x1D163 #x1D164 #x1D1BB #x1D1BC #x1D1BD #x1D1BE
#x1D1BF #x1D1C0)
"Composition Exclusion Table.
The table is taken from
http://www.unicode.org/Public/UNIDATA/CompositionExclusions-4.1.0.txt")
;;
(eval-when-compile
;; generics
(setq combining-chars nil)
(setq decomp-pair-to-comp-hash (make-hash-table :test 'equal :size 2000)) ;
1003 in Unicode 4.1.0 (except hangul)
;; NFC/NFD
(defun nfc-predicate (char) t)
(setq nfc-search-chars nil)
(setq nfc-singletons nil)
(setq nfd-alist nil)
;; HFS-NFC/HFS-NFD
(defun hfs-nfc-predicate (char)
(or (and (>= char 0) (< char #x2000))
(and (>= char #x3000) (< char #xf900))
(and (>= char #xfb00) (< char #x2f800))
(>= char #x30000)))
(setq hfs-nfc-search-chars nil)
(setq hfs-nfc-singletons nil)
(setq hfs-nfd-alist nil)
;; NFKC/NFKD
(setq nfkd-alist nil)
;; macros
(defmacro register-database (predicate
search-chars
singletons
decomposition-alist)
`(when (apply #',predicate (list char))
(let ((decomp-1st-ccc
(get-char-code-property
(car decomposition) 'canonical-combining-class)))
(setq ,decomposition-alist (cons (cons char (vconcat decomposition))
,decomposition-alist))
(unless (memq char normalize-composition-exclusions)
(if (= 1 (length decomposition))
(setq ,singletons (cons char ,singletons))
;; add second char of decomposition for search-chars iff ccc = 0
(if (= 0 (get-char-code-property (cadr decomposition)
'canonical-combining-class))
(add-to-list ',search-chars (cadr decomposition))))
;; For fast normalization, the case of positive integer
;; `decomp-1st-ccc' means that the char should also be
;; considered as something like singleton.
(if (and decomp-1st-ccc (/= 0 decomp-1st-ccc))
(setq ,singletons (cons char ,singletons)))))))
(let ((char 0) ccc decomposition)
(while (< char #x30000)
(setq ccc (get-char-code-property
char 'canonical-combining-class))
(setq decomposition (get-char-code-property
char 'decomposition))
;; add all combining-char to fast search list.
(if (and ccc (/= 0 ccc)) (add-to-list 'combining-chars char))
(when decomposition
(if (symbolp (car decomposition))
;; compatibiliy
(setq nfkd-alist (cons (cons char (vconcat (cdr decomposition)))
nfkd-alist))
;; canonical
(setq nfkd-alist (cons (cons char (vconcat decomposition))
nfkd-alist))
(if (and (= 2 (length decomposition))
(null (memq char normalize-composition-exclusions)))
(puthash decomposition char decomp-pair-to-comp-hash))
;; NFC/NFD
(register-database nfc-predicate
nfc-search-chars
nfc-singletons
nfd-alist)
;; HFS-NFC/HFS-NFD
(register-database hfs-nfc-predicate
hfs-nfc-search-chars
hfs-nfc-singletons
hfs-nfd-alist)))
(setq char (1+ char))))
(defun normalize-repetitively-expand-alist (alist)
(let ((flag t) elem hit)
(while flag
(setq flag nil)
(dolist (entry alist)
(setq elem (mapcar (lambda (x)
(setq hit (assoc x alist))
(if hit (setq flag t))
(if hit (mapcar 'identity (cdr hit)) (list x)))
(cdr entry)))
(if flag (setcdr entry (vconcat (apply 'append elem))))))))
(normalize-repetitively-expand-alist nfd-alist)
(normalize-repetitively-expand-alist hfs-nfd-alist)
(normalize-repetitively-expand-alist nfkd-alist)
)
(defvar normalize-combining-chars-regexp nil
"Regular expression to match sequence of combining characters.")
(setq normalize-combining-chars-regexp
(eval-when-compile (concat (regexp-opt (mapcar 'char-to-string
combining-chars)) "+")))
(defvar normalize-decomp-pair-to-comp-hash nil
"Hashtable of decomposed pair to primary composite.")
(setq normalize-decomp-pair-to-comp-hash
(eval-when-compile decomp-pair-to-comp-hash))
;; NFC/NFD
(defvar normalize-nfc-search-regexp nil)
(setq normalize-nfc-search-regexp
(eval-when-compile (regexp-opt
(append (mapcar 'char-to-string nfc-search-chars)
(mapcar 'char-to-string combining-chars)
(mapcar 'char-to-string
normalize-composition-exclusions)
(mapcar 'char-to-string nfc-singletons)))))
(defvar normalize-nfc-translation-chars nil)
(setq normalize-nfc-translation-chars
(eval-when-compile (append normalize-composition-exclusions nfc-singletons)))
(define-translation-table 'normalize-nfd-table
(eval-when-compile (make-translation-table-from-alist nfd-alist)))
;; HFS-NFC/NFD
(defvar normalize-hfs-nfc-search-regexp nil)
(setq normalize-hfs-nfc-search-regexp
(eval-when-compile (regexp-opt
(append (mapcar 'char-to-string hfs-nfc-search-chars)
(mapcar 'char-to-string combining-chars)
(mapcar 'char-to-string
normalize-composition-exclusions)
(mapcar 'char-to-string hfs-nfc-singletons)))))
(defvar normalize-hfs-nfc-translation-chars nil)
(setq normalize-hfs-nfc-translation-chars
(eval-when-compile (append normalize-composition-exclusions
hfs-nfc-singletons)))
(define-translation-table 'normalize-hfs-nfd-table
(eval-when-compile (make-translation-table-from-alist hfs-nfd-alist)))
;; NFKC/NFKD
(define-translation-table 'normalize-nfkd-table
(eval-when-compile (make-translation-table-from-alist nfkd-alist)))
;;------------------------------------------------------------------------------------------
;; Normalize local region.
(defun normalize-block (from to &optional translation-table compose)
;; block-string = [starter? diacritics+]
(save-restriction
(narrow-to-region from to)
(goto-char (point-min))
;; decompose if needed.
(if translation-table
(translate-region from to translation-table))
(let ((starter (char-after from)) diacritics)
(if (= (get-char-code-property starter 'canonical-combining-class) 0)
(setq diacritics (string-to-list (buffer-substring (1+ from)
(point-max))))
(setq starter nil)
(setq diacritics (string-to-list (buffer-substring from (point-max)))))
;; sort diacritical marks if needed.
(if (< 1 (length diacritics))
(setq diacritics
(sort diacritics
(lambda (ch1 ch2)
(< (get-char-code-property ch1
'canonical-combining-class)
(get-char-code-property ch2
'canonical-combining-class))))))
;; compose if needed (and both starter and diacritics exists).
(if (and compose (and starter diacritics))
(let ((diac diacritics) prev (prev-ccc 0) ccc prim-comp)
(while diac
(setq ccc (get-char-code-property (car diac)
'canonical-combining-class))
(if (and (or (< prev-ccc ccc) (= ccc 0)) ;; e.g. #x09c7 #x09be ->
#x09cb
(setq prim-comp (gethash (list starter (car diac))
normalize-decomp-pair-to-comp-hash)))
(progn
(setq prev-ccc 0)
(setq starter prim-comp)
(if prev (setcdr prev (cdr diac)) (setq diacritics (cdr
diacritics)))
(setq diac diacritics))
(setq prev-ccc ccc
prev diac
diac (cdr diac))))))
;; replace buffer
(delete-region (point-min) (point-max))
(insert (concat (if starter (list starter)) diacritics)))
(point-max))) ;; return the last position.
(defun normalize-composition-region (from to regexp translation-table
translation-chars
&optional compat)
(save-excursion
(save-restriction
(narrow-to-region from to)
(if compat (translate-region from to 'normalize-nfkd-table))
(goto-char (point-min))
(let (start-pos starter)
(while (re-search-forward regexp nil t)
(setq starter (string-to-char (match-string 0)))
(setq start-pos (match-beginning 0))
(if (memq starter translation-chars)
(progn
(translate-region start-pos (match-end 0) translation-table)
(goto-char start-pos))
(goto-char
(normalize-block
;; from
(if (or (= start-pos (point-min))
(and (= 0 (get-char-code-property starter
'canonical-combining-class))
(/= 0 (get-char-code-property (char-after (1-
start-pos))
'canonical-combining-class))))
start-pos (1- start-pos))
;; to
(if (looking-at normalize-combining-chars-regexp)
(match-end 0) (1+ start-pos))
'normalize-nfd-table t))))))))
(defun normalize-decomposition-region (from to translation-table)
(save-excursion
(save-restriction
(narrow-to-region from to)
(goto-char (point-min))
(let (start-pos starter)
(while (re-search-forward normalize-combining-chars-regexp nil t)
(setq starter (string-to-char (match-string 0)))
(setq start-pos (match-beginning 0))
(goto-char
(normalize-block
(if (= start-pos (point-min)) start-pos (1- start-pos))
(match-end 0)
translation-table nil))))
(translate-region (point-min) (point-max) translation-table))))
;;
--------------------------------------------------------------------------------
;;;###autoload
(defun normalize-NFC-region (from to)
"Normalize the current region by the Unicode NFC."
(interactive "r")
(normalize-composition-region from to
normalize-nfc-search-regexp
'normalize-nfd-table
normalize-nfc-translation-chars))
;;;###autoload
(defun normalize-NFC-string (str)
"Normalize the string STR by the Unicode NFC."
(with-temp-buffer
(insert str)
(normalize-NFC-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-NFD-region (from to)
"Normalize the current region by the Unicode NFD."
(interactive "r")
(normalize-decomposition-region from to 'normalize-nfd-table))
;;;###autoload
(defun normalize-NFD-string (str)
"Normalize the string STR by the Unicode NFD."
(with-temp-buffer
(insert str)
(normalize-NFD-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-HFS-NFC-region (from to)
"Normalize the current region by the Unicode NFC and Mac OS's HFS Plus."
(interactive "r")
(normalize-composition-region from to
normalize-hfs-nfc-search-regexp
'normalize-hfs-nfd-table
normalize-hfs-nfc-translation-chars))
;;;###autoload
(defun normalize-HFS-NFC-string (str)
"Normalize the string STR by the Unicode NFC and Mac OS's HFS Plus."
(with-temp-buffer
(insert str)
(normalize-HFS-NFC-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-HFS-NFD-region (from to)
"Normalize the current region by the Unicode NFD and Mac OS's HFS Plus."
(interactive "r")
(normalize-decomposition-region from to 'normalize-hfs-nfd-table))
;;;###autoload
(defun normalize-HFS-NFD-string (str)
"Normalize the string STR by the Unicode NFD and Mac OS's HFS Plus."
(with-temp-buffer
(insert str)
(normalize-HFS-NFD-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-NFKC-region (from to)
"Normalize the current region by the Unicode NFKC."
(interactive "r")
(normalize-composition-region from to
normalize-nfc-search-regexp
'normalize-nfd-table
normalize-nfc-translation-chars t))
;;;###autoload
(defun normalize-NFKC-string (str)
"Normalize the string STR by the Unicode NFKC."
(with-temp-buffer
(insert str)
(normalize-NFKC-region (point-min) (point-max))
(buffer-string)))
;;;###autoload
(defun normalize-NFKD-region (from to)
"Normalize the current region by the Unicode NFKD."
(interactive "r")
(normalize-decomposition-region from to 'normalize-nfkd-table))
;;;###autoload
(defun normalize-NFKD-string (str)
"Normalize the string STR by the Unicode NFKD."
(with-temp-buffer
(insert str)
(normalize-NFKD-region (point-min) (point-max))
(buffer-string)))
;; Post-read-conversion function for `utf-8-hfs'.
(defun normalize-hfs-nfd-post-read-conversion (len)
(save-excursion
(save-restriction
(narrow-to-region (point) (+ (point) len))
(let ((buffer-modified-p (buffer-modified-p)))
(normalize-hfs-NFC-region (point-min) (point-max))
(- (point-max) (point-min))))))
;;; coding-system definition
(define-coding-system 'utf-8-hfs
"UTF-8 base coding system with normalization on decoding.
The characters in HFS normalization exclusion area are not normalized.
On encoding, don't perform normalization."
:coding-type 'utf-8
:mnemonic ?U
:charset-list '(unicode)
;; NFD decoder
;; :decode-translation-table (not necessary)
:post-read-conversion 'normalize-hfs-nfd-post-read-conversion
;; NFD encoder is not necessary because MacOS will automatically do it.
;; :encode-translation-table 'nfd-encode-translation-table
;; :pre-write-conversion 'nfd-encode-function
)
;;; normalize.el ends here
- normalize.el,
Kenichi Handa <=
Re: normalize.el, Richard M Stallman, 2009/04/03