[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Character encoding again.
From: |
John Darrington |
Subject: |
Character encoding again. |
Date: |
Fri, 29 Oct 2010 09:28:52 +0000 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
Based in information received it seems that we're going to have to
start setting and acting upon the "character_code" value in pspp sys files,
if we want them to remain compatible with those from SPSS in an
internationalised environment.
Currently, we always set this value to 2 on writing, and ignore it on
reading. However, apparently this causes problems reading utf8 encoded
files on SPSS. Conceivably, it could also mean that PSPP wont properly
read certain SPSS generated files.
Although there is another part of the file pertaining to character encoding,
(record 7 subtype 20) from what I can make out, that affects only the
encoding of the data records, and not the headers (labels etc.).
The character_code is currently documented as:
Character code. 1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
indicates 8-bit ASCII, 4 indicates DEC Kanji. Windows code page
numbers are also valid.
The problem is, that we will need a mapping between this integer and
the strings which are recognised by iconv. According to wikipedia, no such
mapping that is universally accepted exists - every vendor has their
own one! Evidence suggests however that SPSS uses Microsoft's mapping,
even when running on non-Microsoft platforms. So the best source of information
seems to be http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
However, as you will see, this table has only 153 entries whilst "iconv -l" on
my machine generates 1153 encoding names. So the question remains what do we
do with the 1000 character sets not in Microsoft's table? Many of the iconv
names
I suspect are synonyms and we can make educated guesses as to their meaning.
Similarly, a lot of the iconv names are of the form CP%d which suggests a
mapping
to the codepage. However there are still gaps.
Moreover, there are a lot of SPSS data files which I have seen which have this
"character_code" set to 2, yet contain data which are clearly not 7 bit ascii.
Has anyone got any sensible suggestions on how to implement the two functions:
int get_codepage_from_encoding_name (const char*); and
const char *get_encoding_from_codepage (int);
???
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature
- Character encoding again.,
John Darrington <=