Re: gutenberg-coding.el -- coding system for Project Gutenberg files

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gutenberg-coding.el -- coding system for Project Gutenberg files

From:	Kenichi Handa
Subject:	Re: gutenberg-coding.el -- coding system for Project Gutenberg files
Date:	Tue, 25 Oct 2005 11:11:37 +0900
User-agent:	SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/22.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI)

In article <address@hidden>, Kevin Ryde <address@hidden> writes:

> [1  <text/plain (7bit)>]
> This is my go at Project Gutenberg ebook/etext coding system detection
> adapted to the emacs cvs.

> The charset names in the texts are slightly free-form and need an
> unhappy amount of massaging.  "list.log" below is what I grepped out
> of all the current files (about 23000 of them).

> Some charset names are obvious typos (I reported them), but it doesn't
> hurt to allow them.

I think that the code is good to be included in Emacs.  But,
as it's not a bug fix, and I think not many people benefit
from that (how many people read Gutenberg text file?).  So,
I'd like to ask Richard to decide whether we include it now
or postpone it.

---
Kenichi Handa
address@hidden

> 2005-10-24  Kevin Ryde  <address@hidden>

>         * international/mule.el (project-gutenberg-auto-coding-function): New
>         function.
>         (auto-coding-functions): Add it.

> [2 mule.el.gutenberg.diff <text/plain (7bit)>]
> Index: mule.el
> ===================================================================
> RCS file: /cvsroot/emacs/emacs/lisp/international/mule.el,v
> retrieving revision 1.227
> diff -u -c -r1.227 mule.el
> cvs server: conflicting specifications of output style
> *** mule.el   23 Oct 2005 18:24:00 -0000      1.227
> --- mule.el   24 Oct 2005 22:06:19 -0000
> ***************
> *** 1588,1594 ****
>                      (symbol :tag "Coding system"))))
  
>   ;; See the bottom of this file for built-in auto coding functions.
> ! (defcustom auto-coding-functions '(sgml-xml-auto-coding-function
>                                  sgml-html-meta-auto-coding-function)
>     "A list of functions which attempt to determine a coding system.
  
> --- 1588,1595 ----
>                      (symbol :tag "Coding system"))))
  
>   ;; See the bottom of this file for built-in auto coding functions.
> ! (defcustom auto-coding-functions '(project-gutenberg-auto-coding-function
> !                                sgml-xml-auto-coding-function
>                                  sgml-html-meta-auto-coding-function)
>     "A list of functions which attempt to determine a coding system.
  
> ***************
> *** 2204,2209 ****
> --- 2205,2307 ----
  
  
>   ;;; Built-in auto-coding-functions:
> + 
> + (defun project-gutenberg-auto-coding-function (size)
> +   "Determine character encoding of a Project Gutenberg EBook/Etext.
> + This function is designed for use in `auto-coding-functions'.
> + 
> + A Project Gutenberg text has \"Project Gutenberg\" in the first line, and a
> + subsequent \"Character set encoding:\" line.  The latter gives the coding
> + system.
> + 
> + Some early non-ASCII texts don't have a \"Character set encoding:\", for
> + those you have to use other Emacs mechanisms (eg. 
> \\[universal-coding-system-argument]).
> + 
> + See http://www.gutenberg.org for more about Project Gutenberg."
> + 
> +   (and (looking-at ".*Project Gutenberg")
> + 
> +        ;; The regexp here is "^Cha[rt]acter set encoding: *\\(.*\\)", except
> +        ;; tweaked to avoid trailing spaces and \r in the match-string.
> +        ;;
> +        ;; Project Gutenberg files are CRLF line endings (usually) so \r is
> +        ;; normal; and trailing spaces have been seen in a few files.
> +        ;;
> +        ;; "Chatacter" is a typo seen in about 220 files as of 2005 (though
> +        ;; only 38 are non-ASCII).
> +        ;;
> +        (re-search-forward
> +         "^Cha[rt]acter set encoding:[ \t\r]*\\(\\([ \t\r]*[^ 
> \t\r\n]+\\)*\\)"
> +         ;; only search first 200 lines
> +         (save-excursion (forward-line 200) (point))
> +         t)
> + 
> +        ;; The character set names are slightly free form.  They're perfectly
> +        ;; understandable to a human, but need some massaging to get
> +        ;; something `locale-charset-to-coding-system' can handle.  The stuff
> +        ;; below was tested on the full set of files in 2005.
> +        ;;
> +        ;; Some readme.txt files have "MP3" or the like given as the
> +        ;; character set, which is bogus, it refers to the existance of .mp3
> +        ;; files, the .txt is plain ascii.  We let such cases get the warning
> +        ;; message.
> + 
> +        (let* ((orig-charset (match-string 1))
> +               (charset      (downcase orig-charset)))
> + 
> +          ;; "ascii"                 -> "us-ascii"
> +          ;; "iso-646-us (us-ascii)" -> "us-ascii"
> +          (if (member charset '("ascii" "iso-646-us (us-ascii)"))
> +              (setq charset "us-ascii"))
> + 
> +          ;; "ascii, with a few iso-8859-1 characters" etc -> "iso-8859-1"
> +          ;; "acii, with some iso-8859-1 characters"       -> "iso-8859-1"
> +          ;; the "acii" is a typo in dvptn10.txt, easy enough to allow it
> +          (setq charset (replace-regexp-in-string
> +                         "^as?cii[ (,]*with.* \\(iso-8859-[0-9]+\\).*"
> +                         "\\1" charset t))
> + 
> +          ;; "cp-1250"                -> "windows-1250"
> +          ;; "cp1251"                 -> "windows-1251"
> +          ;; "codepage 1250"          -> "windows-1250"
> +          ;; "windows codepage 1252"  -> "windows-1252"
> +          ;; "windows code page 1252" -> "windows-1252"
> +          (setq charset (replace-regexp-in-string
> +                         "^\\(cp\\|codepage\\|windows \\(code ?page\\)?\\)[ 
> -]*"
> +                         "windows-" charset t t))
> + 
> +          ;; "unicode" alone -> "utf-8", found in 10752-8.txt
> +          (setq charset (replace-regexp-in-string "^unicode\r?$" "utf-8"
> +                                                  charset t))
> + 
> +          ;; "unicode utf-8" -> "utf-8"
> +          (setq charset (replace-regexp-in-string "^unicode utf" "utf"
> +                                                  charset t t))
> + 
> +          ;; "unicode (utf-8)" -> "utf-8"
> +          (setq charset (replace-regexp-in-string "^unicode (\\(.*\\))$" 
> "\\1"
> +                                                  charset t))
> + 
> +          ;; "iso-8858-1" -> "iso-8859-1", typo in 10439-8.txt
> +          (setq charset (replace-regexp-in-string "8858" "8859" charset t t))
> + 
> +          ;; "ido-8859-1" -> "iso-8859-1", typo in 10549-8.txt
> +          (setq charset (replace-regexp-in-string "^ido-" "iso-" charset t 
> t))
> + 
> +          ;; "iso 8859-1 (latin-1)" -> "latin-1"
> +          (setq charset (replace-regexp-in-string
> +                         "^iso 8859-\\([0-9]+\\) (\\(latin-\\1\\))$"
> +                         "\\2" charset t))
> + 
> +          ;; "iso=8859-1" -> "iso-8859-1"
> +          ;; "big 5"      -> "big-5"
> +          (setq charset (replace-regexp-in-string "[= ]" "-" charset t t))
> + 
> +          (or (locale-charset-to-coding-system charset)
> +              (progn
> +                (message "Warning: unknown coding system \"%s\""
> +                         orig-charset)
> +                nil)))))
  
>   (defun sgml-xml-auto-coding-function (size)
>     "Determine whether the buffer is XML, and if so, its encoding.
> [3 list.log <text/plain (7bit)>]
> Character set encoding:  ASCII                  eg. kimrk12.txt
> Character set encoding:  ISO8859_1              eg. c1001107.txt
> Character set encoding: ACII, with some ISO-8859-1 characters
>                                                 eg. dvptn10.txt
> Character set encoding: ASCII                   eg. 10001.txt
> Character set encoding: ASCII                     
>                                                 eg. oh11v10.txt
> Character set encoding: ASCII, with 2 ISO-8859-1 characters
>                                                 eg. prpsl10.txt
> Character set encoding: ASCII, with a couple of ISO-8859-1 characters
>                                                 eg. jrcl610.txt
> Character set encoding: ASCII, with a few ISO-8859-1 characters
>                                                 eg. cnnet10.txt
> Character set encoding: ASCII (with a few ISO-8859-1 characters)
>                                                 eg. ltlbh10.txt
> Character set encoding: ASCII, with one ISO-8859-1 character
>                                                 eg. srhrl10.txt
> Character set encoding: ASCII, with some ISO-8859-1 characters
>                                                 eg. bough11.txt
> Character set encoding: ASCII, with two ISO-8859-1 characters
>                                                 eg. prphi10.txt
> Character set encoding: Big 5                   eg. dxizi10.txt
> Character set encoding: BIG-5                   eg. 8dxzj10.txt
> Character set encoding: Big5                    eg. wesik10.txt
> Character set encoding: Codepage 1250           eg. sklep10.txt
> Character set encoding: CP-1250                 eg. 13083-8.txt
> Character set encoding: CP-1251                 eg. 14741-8.txt
> Character set encoding: CP-1252                 eg. 12732-8.txt
> Character set encoding: CP1251                  eg. 11292-8.txt
> Character set encoding: cp1251                  eg. kknta10.txt
> Character set encoding: CP1252                  eg. 8ledo10.txt
> Character set encoding: EUC-KR                  eg. kedct10.txt
> Character set encoding: IDO-8859-1              eg. 10549-8.txt
> Character set encoding: ISO-646-US (US-ASCII)   eg. 107.txt
> Character set encoding: ISO-8858-1              eg. 10439-8.txt
> Character set encoding: ISO 8859-1              eg. 8bld410.txt
> Character set encoding: ISO-8859-1              eg. 10002-8.txt
> Character set encoding: iso-8859-1              eg. 10429-8.txt
> Character set encoding: ISO=8859-1              eg. 7fool10.txt
> Character set encoding: ISO 8859-1 (Latin-1)    eg. 8adio10.txt
> Character set encoding: iso-8859-15             eg. 8dlrm10.txt
> Character set encoding: ISO-8859-2              eg. rnpz810.txt
> Character set encoding: ISO Latin-1             eg. 10056-8.txt
> Character set encoding: ISO-LATIN-1             eg. 8nggd10.txt
> Character set encoding: ISO-Latin-1             eg. hstrd10.txt
> Character set encoding: iso-Latin-1             eg. 8wpwl10.txt
> Character set encoding: iso-latin-1             eg. 8engl10.txt
> Character set encoding: ISO8859-1               eg. 7bjrn10.txt
> Character set encoding: ISO8859_1               eg. a1001107.txt
> Character set encoding: KOI8-R                  eg. ktria10.txt
> Character set encoding: Latin 1                 eg. divrw10.txt
> Character set encoding: Latin-1                 eg. 8dawn10.txt
> Character set encoding: Latin-4                 eg. kalev10.txt
> Character set encoding: Latin1                  eg. 10347-8.txt
> Character set encoding: MP3                     eg. 10348-m-readme.txt
> Character set encoding: MPEG                    eg. atomi10m-readme.txt
> Character set encoding: MPEG Layer 3 (MP3)      eg. 1donq3-readme.txt
> Character set encoding: Unicode                 eg. 10752-8.txt
> Character set encoding: Unicode (UTF-8)         eg. orama10u.txt
> Character set encoding: Unicode UTF-8           eg. 11753-0.txt
> Character set encoding: US-ASCII                eg. 10078.txt
> Character set encoding: US-ASCII, MIDI, Lilypond, MP3 and TeX
>                                                 eg. 10535.txt
> Character set encoding: UTF-16                  eg. 13083-utf16.txt
> Character set encoding: UTF-7                   eg. 8cart10.txt
> Character set encoding: UTF-8                   eg. 10140-0.txt
> Character set encoding: utf-8                   eg. astrl10.txt
> Character set encoding: UTF8                    eg. 8gslt10.txt
> Character set encoding: Windows-1250            eg. 15201-8.txt
> Character set encoding: Windows 1251            eg. olavg10.txt
> Character set encoding: Windows-1252            eg. 8clcn10.txt
> Character set encoding: Windows Code Page 1252  eg. 8tjna10.txt
> Character set encoding: Windows Codepage 1252   eg. 8vepi10.txt
> Character set encoding: Windows1253             eg. orama10.txt
> Chatacter set encoding: ISO-8859-1              eg. 10021-8.txt
> Chatacter set encoding: iso-8859-1              eg. 10026-8.txt
> Chatacter set encoding: MP3                     eg. 10137-m-readme.txt
> Chatacter set encoding: Sibelius 3 SIB format and MP3 audio
>                                                 eg. 10344-readme.txt
> Chatacter set encoding: US-ASCII                eg. 10021.txt
> [4  <text/plain; us-ascii (7bit)>]
> _______________________________________________
> Emacs-devel mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/emacs-devel

[Prev in Thread]

Current Thread

[Next in Thread]

Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Kevin Ryde, 2005/10/24
- Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Kenichi Handa <=
  - Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Stefan Monnier, 2005/10/24
    - Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Kevin Ryde, 2005/10/25
    - Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Richard M. Stallman, 2005/10/26
    - Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Kevin Ryde, 2005/10/26
    - Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Richard M. Stallman, 2005/10/27
    - Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Kevin Ryde, 2005/10/27
  - Re: gutenberg-coding.el -- coding system for Project Gutenberg files, Richard M. Stallman, 2005/10/25

Prev by Date: Re: Probably dumb question: glyph rendering on unicode-2 branch
Next by Date: Re: Probably dumb question: glyph rendering on unicode-2 branch
Previous by thread: Re: gutenberg-coding.el -- coding system for Project Gutenberg files
Next by thread: Re: gutenberg-coding.el -- coding system for Project Gutenberg files
Index(es):
- Date
- Thread