Character encoding again.

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Character encoding again.

From:	John Darrington
Subject:	Character encoding again.
Date:	Fri, 29 Oct 2010 09:28:52 +0000
User-agent:	Mutt/1.5.18 (2008-05-17)

Based in information received it seems that we're going to have to 
start setting and acting upon the "character_code" value in pspp sys files,
if we want them to remain compatible with those from SPSS in an 
internationalised environment.

Currently, we always set this value to 2 on writing, and ignore it on
reading.  However, apparently this causes problems reading utf8 encoded
files on SPSS.  Conceivably, it could also mean that PSPP wont properly
read certain SPSS generated files.

Although there is another part of the file pertaining to character encoding,
(record 7 subtype 20) from what I can make out, that affects only the
encoding of the data records, and not the headers (labels etc.).

The character_code is currently documented as:

  Character code.  1 indicates EBCDIC, 2 indicates 7-bit ASCII, 3
  indicates 8-bit ASCII, 4 indicates DEC Kanji.  Windows code page
  numbers are also valid.


The problem is, that we will need a mapping between this integer and 
the strings which are recognised by iconv.  According to wikipedia, no such 
mapping that is universally accepted exists - every vendor has their
own one!  Evidence suggests however that SPSS uses Microsoft's mapping, 
even when running on non-Microsoft platforms.  So the best source of information
seems to be http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx

However, as you will see, this table has only 153 entries whilst  "iconv -l" on 
my machine generates 1153 encoding names.  So the question remains what do we 
do with the 1000 character sets not in Microsoft's table?  Many of the iconv 
names
I suspect are synonyms and we can make educated guesses as to their meaning.
Similarly, a lot of the iconv names are of the form CP%d which suggests a 
mapping
to the codepage.  However there are still gaps.

Moreover, there are a lot of SPSS data files which I have seen which have this
"character_code" set to 2, yet contain data which are clearly not 7 bit ascii.

Has anyone got any sensible suggestions on how to implement the two functions:

 int get_codepage_from_encoding_name (const char*);  and
 const char *get_encoding_from_codepage (int);

???


-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Character encoding again., John Darrington <=
- Re: Character encoding again., Ben Pfaff, 2010/10/31
  - Re: Character encoding again., John Darrington, 2010/10/31

Prev by Date: Re: packages for opensuse
Next by Date: Re: packages for opensuse
Previous by thread: packages for opensuse
Next by thread: Re: Character encoding again.
Index(es):
- Date
- Thread