[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: even more about character encoding names
From: |
John Darrington |
Subject: |
Re: even more about character encoding names |
Date: |
Wed, 5 Jan 2011 12:02:25 +0000 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
This seems to cover everything.
A purist might object to calling windows-1252 a "superset" of iso-8859-1 ...
they are just two different encodings, which happen to have large parts of
they're mappings identical.
J'
On Mon, Jan 03, 2011 at 10:45:12AM -0800, Ben Pfaff wrote:
I think you've told me all of this before. It's time to write it
down. Here's what I have as an update to
system-file-format.texi. Can you look it over and verify that it
looks accurate? Also, if you have any system files locally that
have other codepage numbers not already mentioned, please let me
know which ones and I'll add them to the list.
--8<--------------------------cut here-------------------------->8--
From: Ben Pfaff <address@hidden>
Date: Mon, 3 Jan 2011 10:43:21 -0800
Subject: [PATCH] doc: Update description of character encoding information
in system files.
Based on information provided by John Darrington and on system files
obtained freely from the Internet.
---
doc/dev/system-file-format.texi | 66
+++++++++++++++++++++++++++++++++------
1 files changed, 56 insertions(+), 10 deletions(-)
diff --git a/doc/dev/system-file-format.texi
b/doc/dev/system-file-format.texi
index 972b133..bf376b5 100644
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -549,14 +549,46 @@ Compression code. Always set to 1.
Machine endianness. 1 indicates big-endian, 2 indicates little-endian.
@item int32 character_code;
address@hidden
-Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
-indicates 8-bit ASCII, 4 indicates DEC Kanji.
-Windows code page numbers are also valid.
-
-Experience has shown that in many files, this field is ignored or
incorrect.
-For a more reliable indication of the file's character encoding
-see @ref{Character Encoding Record}.
address@hidden Character code. The following values have
+been actually observed in system files:
+
address@hidden @asis
address@hidden 2
+7-bit ASCII.
+
address@hidden 1250
+The @code{windows-1250} code page for Central European and Eastern
+European languages.
+
address@hidden 1252
+The @code{windows-1252} code page for Western European languages, a
+superset of ISO 8859-1.
+
address@hidden 28591
+ISO 8859-1.
+
address@hidden 65001
+UTF-8.
address@hidden table
+
+The following additional values are known to be defined:
+
address@hidden @asis
address@hidden 1
+EBCDIC.
+
address@hidden 3
+8-bit ``ASCII''.
+
address@hidden 4
+DEC Kanji.
address@hidden table
+
+Other Windows code page numbers are known to be generally valid.
+
+Old versions of SPSS always wrote value 2 in this field, regardless of
+the encoding in use. Newer versions also write the character encoding
+as a string (see @ref{Character Encoding Record}).
@end table
@node Machine Floating-Point Info Record
@@ -959,8 +991,22 @@ The name of the character encoding. Normally this
will be an official IANA char
See @url{http://www.iana.org/assignments/character-sets}.
@end table
-This record is not present in files generated by older software.
-See also @ref{character-code}.
+This record is not present in files generated by older software. See
+also the @code{character_code} field in the machine integer info
+record (@pxref{character-code}).
+
+When the character encoding record and the machine integer info record
+are both present, all system files observed in practice indicate the
+same character encoding, e.g.@: 1252 as @code{character_code} and
address@hidden as @code{encoding}, 65001 and @code{UTF-8}, etc.
+
+If, for testing purposes, a file is crafted with different
address@hidden and @code{encoding}, it seems that
address@hidden controls the encoding for all strings in the
+system file before the dictionary termination record, including
+strings in data (e.g.@: string missing values), and @code{encoding}
+controls the encoding for strings following the dictionary termination
+record.
@node Long String Value Labels Record
@section Long String Value Labels Record
--
1.7.1
--
Ben Pfaff
http://benpfaff.org
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature