[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gutenberg-coding.el -- coding system for Project Gutenberg files
From: |
Kenichi Handa |
Subject: |
Re: gutenberg-coding.el -- coding system for Project Gutenberg files |
Date: |
Tue, 25 Oct 2005 11:11:37 +0900 |
User-agent: |
SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/22.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) |
In article <address@hidden>, Kevin Ryde <address@hidden> writes:
> [1 <text/plain (7bit)>]
> This is my go at Project Gutenberg ebook/etext coding system detection
> adapted to the emacs cvs.
> The charset names in the texts are slightly free-form and need an
> unhappy amount of massaging. "list.log" below is what I grepped out
> of all the current files (about 23000 of them).
> Some charset names are obvious typos (I reported them), but it doesn't
> hurt to allow them.
I think that the code is good to be included in Emacs. But,
as it's not a bug fix, and I think not many people benefit
from that (how many people read Gutenberg text file?). So,
I'd like to ask Richard to decide whether we include it now
or postpone it.
---
Kenichi Handa
address@hidden
> 2005-10-24 Kevin Ryde <address@hidden>
> * international/mule.el (project-gutenberg-auto-coding-function): New
> function.
> (auto-coding-functions): Add it.
> [2 mule.el.gutenberg.diff <text/plain (7bit)>]
> Index: mule.el
> ===================================================================
> RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v
> retrieving revision 1.227
> diff -u -c -r1.227 mule.el
> cvs server: conflicting specifications of output style
> *** mule.el 23 Oct 2005 18:24:00 -0000 1.227
> --- mule.el 24 Oct 2005 22:06:19 -0000
> ***************
> *** 1588,1594 ****
> (symbol :tag "Coding system"))))
> ;; See the bottom of this file for built-in auto coding functions.
> ! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function
> sgml-html-meta-auto-coding-function)
> "A list of functions which attempt to determine a coding system.
> --- 1588,1595 ----
> (symbol :tag "Coding system"))))
> ;; See the bottom of this file for built-in auto coding functions.
> ! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function
> ! sgml-xml-auto-coding-function
> sgml-html-meta-auto-coding-function)
> "A list of functions which attempt to determine a coding system.
> ***************
> *** 2204,2209 ****
> --- 2205,2307 ----
> ;;; Built-in auto-coding-functions:
> +
> + (defun project-gutenberg-auto-coding-function (size)
> + "Determine character encoding of a Project Gutenberg EBook/Etext.
> + This function is designed for use in `auto-coding-functions'.
> +
> + A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a
> + subsequent \"Character set encoding:\" line. The latter gives the coding
> + system.
> +
> + Some early non-ASCII texts don't have a \"Character set encoding:\", for
> + those you have to use other Emacs mechanisms (eg.
> \\[universal-coding-system-argument]).
> +
> + See http://www.gutenberg.org for more about Project Gutenberg."
> +
> + (and (looking-at ".*Project Gutenberg")
> +
> + ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except
> + ;; tweaked to avoid trailing spaces and \r in the match-string.
> + ;;
> + ;; Project Gutenberg files are CRLF line endings (usually) so \r is
> + ;; normal; and trailing spaces have been seen in a few files.
> + ;;
> + ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though
> + ;; only 38 are non-ASCII).
> + ;;
> + (re-search-forward
> + "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^
> \t\r\n]+\\)*\\)"
> + ;; only search first 200 lines
> + (save-excursion (forward-line 200) (point))
> + t)
> +
> + ;; The character set names are slightly free form. They're perfectly
> + ;; understandable to a human, but need some massaging to get
> + ;; something `locale-charset-to-coding-system' can handle. The stuff
> + ;; below was tested on the full set of files in 2005.
> + ;;
> + ;; Some readme.txt files have "MP3" or the like given as the
> + ;; character set, which is bogus, it refers to the existance of .mp3
> + ;; files, the .txt is plain ascii. We let such cases get the warning
> + ;; message.
> +
> + (let* ((orig-charset (match-string 1))
> + (charset (downcase orig-charset)))
> +
> + ;; "ascii" -> "us-ascii"
> + ;; "iso-646-us (us-ascii)" -> "us-ascii"
> + (if (member charset '("ascii" "iso-646-us (us-ascii)"))
> + (setq charset "us-ascii"))
> +
> + ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1"
> + ;; "acii, with some iso-8859-1 characters" -> "iso-8859-1"
> + ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it
> + (setq charset (replace-regexp-in-string
> + "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*"
> + "\\1" charset t))
> +
> + ;; "cp-1250" -> "windows-1250"
> + ;; "cp1251" -> "windows-1251"
> + ;; "codepage 1250" -> "windows-1250"
> + ;; "windows codepage 1252" -> "windows-1252"
> + ;; "windows code page 1252" -> "windows-1252"
> + (setq charset (replace-regexp-in-string
> + "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[
> -]*"
> + "windows-" charset t t))
> +
> + ;; "unicode" alone -> "utf-8", found in 10752-8.txt
> + (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8"
> + charset t))
> +
> + ;; "unicode utf-8" -> "utf-8"
> + (setq charset (replace-regexp-in-string "^unicode utf" "utf"
> + charset t t))
> +
> + ;; "unicode (utf-8)" -> "utf-8"
> + (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$"
> "\\1"
> + charset t))
> +
> + ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt
> + (setq charset (replace-regexp-in-string "8858" "8859" charset t t))
> +
> + ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt
> + (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t
> t))
> +
> + ;; "iso 8859-1 (latin-1)" -> "latin-1"
> + (setq charset (replace-regexp-in-string
> + "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$"
> + "\\2" charset t))
> +
> + ;; "iso=8859-1" -> "iso-8859-1"
> + ;; "big 5" -> "big-5"
> + (setq charset (replace-regexp-in-string "[= ]" "-" charset t t))
> +
> + (or (locale-charset-to-coding-system charset)
> + (progn
> + (message "Warning: unknown coding system \"%s\""
> + orig-charset)
> + nil)))))
> (defun sgml-xml-auto-coding-function (size)
> "Determine whether the buffer is XML, and if so, its encoding.
> [3 list.log <text/plain (7bit)>]
> Character set encoding: ASCII eg. kimrk12.txt
> Character set encoding: ISO8859_1 eg. c1001107.txt
> Character set encoding: ACII, with some ISO-8859-1 characters
> eg. dvptn10.txt
> Character set encoding: ASCII eg. 10001.txt
> Character set encoding: ASCII
> eg. oh11v10.txt
> Character set encoding: ASCII, with 2 ISO-8859-1 characters
> eg. prpsl10.txt
> Character set encoding: ASCII, with a couple of ISO-8859-1 characters
> eg. jrcl610.txt
> Character set encoding: ASCII, with a few ISO-8859-1 characters
> eg. cnnet10.txt
> Character set encoding: ASCII (with a few ISO-8859-1 characters)
> eg. ltlbh10.txt
> Character set encoding: ASCII, with one ISO-8859-1 character
> eg. srhrl10.txt
> Character set encoding: ASCII, with some ISO-8859-1 characters
> eg. bough11.txt
> Character set encoding: ASCII, with two ISO-8859-1 characters
> eg. prphi10.txt
> Character set encoding: Big 5 eg. dxizi10.txt
> Character set encoding: BIG-5 eg. 8dxzj10.txt
> Character set encoding: Big5 eg. wesik10.txt
> Character set encoding: Codepage 1250 eg. sklep10.txt
> Character set encoding: CP-1250 eg. 13083-8.txt
> Character set encoding: CP-1251 eg. 14741-8.txt
> Character set encoding: CP-1252 eg. 12732-8.txt
> Character set encoding: CP1251 eg. 11292-8.txt
> Character set encoding: cp1251 eg. kknta10.txt
> Character set encoding: CP1252 eg. 8ledo10.txt
> Character set encoding: EUC-KR eg. kedct10.txt
> Character set encoding: IDO-8859-1 eg. 10549-8.txt
> Character set encoding: ISO-646-US (US-ASCII) eg. 107.txt
> Character set encoding: ISO-8858-1 eg. 10439-8.txt
> Character set encoding: ISO 8859-1 eg. 8bld410.txt
> Character set encoding: ISO-8859-1 eg. 10002-8.txt
> Character set encoding: iso-8859-1 eg. 10429-8.txt
> Character set encoding: ISO=8859-1 eg. 7fool10.txt
> Character set encoding: ISO 8859-1 (Latin-1) eg. 8adio10.txt
> Character set encoding: iso-8859-15 eg. 8dlrm10.txt
> Character set encoding: ISO-8859-2 eg. rnpz810.txt
> Character set encoding: ISO Latin-1 eg. 10056-8.txt
> Character set encoding: ISO-LATIN-1 eg. 8nggd10.txt
> Character set encoding: ISO-Latin-1 eg. hstrd10.txt
> Character set encoding: iso-Latin-1 eg. 8wpwl10.txt
> Character set encoding: iso-latin-1 eg. 8engl10.txt
> Character set encoding: ISO8859-1 eg. 7bjrn10.txt
> Character set encoding: ISO8859_1 eg. a1001107.txt
> Character set encoding: KOI8-R eg. ktria10.txt
> Character set encoding: Latin 1 eg. divrw10.txt
> Character set encoding: Latin-1 eg. 8dawn10.txt
> Character set encoding: Latin-4 eg. kalev10.txt
> Character set encoding: Latin1 eg. 10347-8.txt
> Character set encoding: MP3 eg. 10348-m-readme.txt
> Character set encoding: MPEG eg. atomi10m-readme.txt
> Character set encoding: MPEG Layer 3 (MP3) eg. 1donq3-readme.txt
> Character set encoding: Unicode eg. 10752-8.txt
> Character set encoding: Unicode (UTF-8) eg. orama10u.txt
> Character set encoding: Unicode UTF-8 eg. 11753-0.txt
> Character set encoding: US-ASCII eg. 10078.txt
> Character set encoding: US-ASCII, MIDI, Lilypond, MP3 and TeX
> eg. 10535.txt
> Character set encoding: UTF-16 eg. 13083-utf16.txt
> Character set encoding: UTF-7 eg. 8cart10.txt
> Character set encoding: UTF-8 eg. 10140-0.txt
> Character set encoding: utf-8 eg. astrl10.txt
> Character set encoding: UTF8 eg. 8gslt10.txt
> Character set encoding: Windows-1250 eg. 15201-8.txt
> Character set encoding: Windows 1251 eg. olavg10.txt
> Character set encoding: Windows-1252 eg. 8clcn10.txt
> Character set encoding: Windows Code Page 1252 eg. 8tjna10.txt
> Character set encoding: Windows Codepage 1252 eg. 8vepi10.txt
> Character set encoding: Windows1253 eg. orama10.txt
> Chatacter set encoding: ISO-8859-1 eg. 10021-8.txt
> Chatacter set encoding: iso-8859-1 eg. 10026-8.txt
> Chatacter set encoding: MP3 eg. 10137-m-readme.txt
> Chatacter set encoding: Sibelius 3 SIB format and MP3 audio
> eg. 10344-readme.txt
> Chatacter set encoding: US-ASCII eg. 10021.txt
> [4 <text/plain; us-ascii (7bit)>]
> _______________________________________________
> Emacs-devel mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/emacs-devel
Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Richard M. Stallman, 2005/10/25