[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts
From: |
Müller , Andre |
Subject: |
Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names |
Date: |
Tue, 18 Feb 2014 15:17:26 +0000 |
> -----Original Message-----
> From: Ben Pfaff [mailto:address@hidden
> Sent: Monday, February 17, 2014 00:07
> To: Müller, Andre
> Cc: address@hidden
> Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> umlauts in variable names
>
> On Mon, Feb 10, 2014 at 04:29:11PM +0000, M?ller, Andre wrote:
> > > -----Original Message-----
> > > From: Ben Pfaff [mailto:address@hidden
> > > Sent: Monday, February 10, 2014 17:16
> > > To: M?ller, Andre
> > > Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> > > umlauts in variable names
> > >
> > > On Mon, Feb 10, 2014 at 12:26:36PM +0000, M?ller, Andre wrote:
> > > > So I learn the .sav-file has no internal markers for the codepage used
> > > > --
> > > > which in turn explains a lot of the codepage woes I have seen.
> > > > Thus, I will have to add a codepage-heuristic to my export-tool.
> > >
> > > It's only the very old SPSS files that lack an indication of codepage.
> > > This causes problems for a surprising number of PSPP users, so I'm
> > > working to add some codepage analysis to PSPP as well.
> >
> > Oh dear, that's work I'd hate to do for the general case.
> > I do have the advantage of a limited set of failure cases (~2k as my current
> estimate)
> > and a strong tendency for them to be from western europe,
> > so I can check the "file -bi" state of the output and check for umlaut
> presence.
> >
> > Most of the errors will go rather unnoticed, as the non-us-ascii chars are
> > not in the "functional" parts but only in the labels. There I find non-us-
> ascii chars replaced
> > to "?".
> >
> > Nevertheless: That work is much appreciated, and I'm looking forward to
> be able and
> > throw my lousy heuristics away.
>
> I committed this work, in the form of a new option to SYSFILE INFO that,
> instead of outputting the system file dictionary, outputs an analysis of
> the string data in the dictionary.
>
> This would be better if it were easily accessible through the GUI, but I
> guess that can be added later if necessary.
>
> Example of use:
> SYSFILE INFO FILE='ZA4209.sav' ENCODING='DETECT'.
Hi Ben,
I have tried the SYSFILE INFO and it works quite well.
For now, I have piped some examples of uncommon codepages through it,
and it does well for SHIFT_JIS and IBM850 (or similar), for example.
The broken files I have, that actually contain entries in more than one
codepage
are not a valid test, but even then, I found at least some of the codepages it
contains as suggestions.
That's nice.
Another rather unfair testcase is a failure to identify a source file in
DIN_66003 coding,
but that really is to be expected -- DIN_66003 is a 7-bit-safe codepage for
german,
where aöüÄÖÜߧ take the place of us-ascii's {|}[\]~@, respectively. An evil
solution for problems long gone.
I think it's sane to not try and handle 7-bit non-ascii codings, so that's just
to let you know.
Really I cannot think of any way of handling them short of looking at oddities
in character counts
or success rates with matches against dictionarys.
I will keep testing datasets with these and hopefully can happily say goodbye
to my bash-hackery:
locale -m | while read codepage; do echo -e "GET FILE='source.sav'
ENCODING='$codepage' \nDISPLAY DICTIONARY" > psy ; echo "Now: $codepage" >>
cp_catastrophe; /usr/bin/pspp -b -O format=csv -O separator=" " -O quote='"'
psy | grep -e "whatever" -e "are troublemakers" >> cp_catastrophe; done
Best,
Andre
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/02
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/03
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/03
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/04
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/04
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/04
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/08
- Message not available
- Message not available
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/10
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/10
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/16
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names,
Müller , Andre <=
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Ben Pfaff, 2014/02/18
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names, Müller , Andre, 2014/02/18