[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts
From: |
Müller , Andre |
Subject: |
Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names |
Date: |
Tue, 18 Feb 2014 18:37:37 +0000 |
> -----Original Message-----
> From: Ben Pfaff [mailto:address@hidden
> Sent: Tuesday, February 18, 2014 18:33
> To: Müller, Andre
> Cc: address@hidden
> Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> umlauts in variable names
>
> On Tue, Feb 18, 2014 at 03:17:26PM +0000, M??ller, Andre wrote:
> > Another rather unfair testcase is a failure to identify a source file
> > in DIN_66003 coding, but that really is to be expected -- DIN_66003 is
> > a 7-bit-safe codepage for german, where a?????????????? take the place
> > of us-ascii's {|}[\]~@, respectively. An evil solution for problems
> > long gone. I think it's sane to not try and handle 7-bit non-ascii
> > codings, so that's just to let you know. Really I cannot think of any
> > way of handling them short of looking at oddities in character counts
> > or success rates with matches against dictionarys.
>
> The code that I wrote doesn't really identify encodings at all.
> Instead, it just tries to recode all the strings in the file from each
> of several possible encodings to UTF-8. That means that it's easy to
> add more encodings, including DIN_66003. The encodings that I chose are
> fairly arbitrary: I took them from the list at
> http://encoding.spec.whatwg.org/. I can add DIN_66003; no problem. Are
> there other encodings I should add?
Yes, I found that by "reading" your code... with reading in quotes because of
my utter
lack of C knowledge. At least I can read the commentary, and it's quite
thorough.
In any case, I indeed missed one codepage on my first tests: IBM850.
That is the predecessor to windows-1252, also called ms-dos latin1.
To my surprise, it is not listed on the encoding.spec page.
I think that would be a worthwile addition.
More worthwile than the really strange and old DIN_66003.
It would show up everytime the file actually is pure us-ascii.
But nevertheless, this obviously has been used, so you may want to add it.
I really leave that up to you, it may be opening a can of worms.
DIN_66003 is just the german variant of ISO_646 and there are a whole bunch
of national variants to it: https://en.wikipedia.org/wiki/ISO/IEC_646
That may end up in a list from hell for each dataset coded in plain us-ascii.
That's all I know by now, but I am still digging through my pile and expect that
I still missed a few oddities; I will have to run more thorough tests later on.
So, if I identify any more codepages not covered by SYSFILE INFO, I will let
you know.
Vielen Dank,
Andre
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, (continued)
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/03
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/04
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/04
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/04
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/08
- Message not available
- Message not available
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/10
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/10
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/16
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/18
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/18
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names,
Müller , Andre <=
Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/02