[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
first draft ZSAV implementation
From: |
Ben Pfaff |
Subject: |
first draft ZSAV implementation |
Date: |
Tue, 15 Oct 2013 00:16:05 -0700 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
I'm working on a ZSAV implementation. Since users seem eager for this,
here's a first draft. It reads all the ZSAV files I've encountered so
far. It needs some tests and probably a writer implementation. Those
will take a few days.
--8<--------------------------cut here-------------------------->8--
From: Ben Pfaff <address@hidden>
Date: Tue, 15 Oct 2013 00:14:01 -0700
Subject: [PATCH] Work on ZSAV implementation.
---
doc/dev/system-file-format.texi | 182 ++++++++++++++++++---
src/data/sys-file-private.h | 14 +-
src/data/sys-file-reader.c | 265 ++++++++++++++++++++++++++++---
src/data/sys-file-reader.h | 10 +-
src/language/dictionary/sys-file-info.c | 7 +-
utilities/pspp-dump-sav.c | 139 ++++++++++++++--
6 files changed, 558 insertions(+), 59 deletions(-)
diff --git a/doc/dev/system-file-format.texi b/doc/dev/system-file-format.texi
index f408ff2..fc9a455 100644
--- a/doc/dev/system-file-format.texi
+++ b/doc/dev/system-file-format.texi
@@ -56,6 +56,18 @@ appears in system files only in missing value ranges, which
never
contain SYSMIS.
@end table
+System files may use most character encodings based on an 8-bit unit.
+UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
address@hidden in the file header record is sufficient to distinguish
+between ASCII and EBCDIC based encodings. The best way to determine
+the specific encoding in use is to consult the character encoding
+record (@pxref{Character Encoding Record}), if present, and failing
+that the @code{character_code} in the machine integer info record
+(@pxref{Machine Integer Info Record}). The same encoding should be
+used for the dictionary and the data in the file, although it is
+possible to artificially synthesize files that use different encodings
+(@pxref{Character Encoding Record}).
+
System files are divided into records, each of which begins with a
4-byte record type, usually regarded as an @code{int32}.
@@ -121,7 +133,7 @@ char rec_type[4];
char prod_name[60];
int32 layout_code;
int32 nominal_case_size;
-int32 compressed;
+int32 compression;
int32 weight_index;
int32 ncases;
flt64 bias;
@@ -133,9 +145,17 @@ char padding[3];
@table @code
@item char rec_type[4];
-Record type code, set to @samp{$FL2}, that is, either @code{24 46 4c
-32} if the file uses an ASCII-based character encoding, or @code{5b c6
-d3 f2} if the file uses an EBCDIC-based character encoding.
+Record type code, either @samp{$FL2} for system files with
+uncompressed data or data compressed with simple bytecode compression,
+or @samp{$FL3} for system files with ZLIB compressed data.
+
+This is truly a character field that uses the character encoding as
+other strings. Thus, in a file with an ASCII-based character encoding
+this field contains @code{24 46 4c 32} or @code{24 46 4c 33}, and in a
+file with an EBCDIC-based encoding this field contains @code{5b c6 d3
+f2}. (SPSS documentation states that ZLIB-compressed files must be
+encoded in UTF-8, so EBCDIC-based ZLIB-compressed files presumably do
+not exist.)
@item char prod_name[60];
Product identification string. This always begins with the characters
@@ -160,7 +180,10 @@ files written by some systems set this value to -1. In
general, it is
unsafe for systems reading system files to rely upon this value.
@item int32 compressed;
-Set to 1 if the data in the file is compressed, 0 otherwise.
+Set to 0 if the data in the file is not compressed, 1 if the data is
+compressed with simple bytecode compression, 2 if the data is ZLIB
+compressed. This field has value 2 if and only if @code{rec_type} is
address@hidden
@item int32 weight_index;
If one of the variables in the data set is used as a weighting
@@ -577,7 +600,8 @@ Floating point representation code. For IEEE 754 systems
this is 1.
IBM 370 sets this to 2, and DEC VAX E to 3.
@item int32 compression_code;
-Compression code. Always set to 1.
+Compression code. Always set to 1, regardless of whether or how the
+file is compressed.
@item int32 endianness;
Machine endianness. 1 indicates big-endian, 2 indicates little-endian.
@@ -1434,22 +1458,23 @@ Ignored padding. Should be set to 0.
@node Data Record
@section Data Record
-Data records must follow all other records in the system file. There must
-be at least one data record in every system file.
-
-The format of data records varies depending on whether the data is
-compressed. Regardless, the data is arranged in a series of 8-byte
-elements.
+The data record must follow all other records in the system file.
+Every system file must have a data record that specifies data for at
+least one case. The format of the data record varies depending on the
+value of @code{compression} in the file header record:
-When data is not compressed,
-each element corresponds to
address@hidden @asis
address@hidden 0: no compression
+Data is arranged as a series of 8-byte elements.
+Each element corresponds to
the variable declared in the respective variable record (@pxref{Variable
Record}). Numeric values are given in @code{flt64} format; string
values are literal characters string, padded on the right when
necessary to fill out 8-byte units.
-Compressed data is arranged in the following manner: the first 8 bytes
-in the data section is divided into a series of 1-byte command
address@hidden 1: bytecode compression
+The first 8 bytes
+of the data record is divided into a series of 1-byte command
codes. These codes have meanings as described below:
@table @asis
@@ -1487,8 +1512,125 @@ An 8-byte string value that is all spaces.
The system-missing value.
@end table
-When the end of the an 8-byte group of command bytes is reached, any
-blocks of non-compressible values indicated by code 253 are skipped,
-and the next element of command bytes is read and interpreted, until
-the end of the file or a code with value 252 is reached.
+The end of the 8-byte group of bytecodes is followed by any 8-byte
+blocks of non-compressible values indicated by code 253. After that
+follows another 8-byte group of bytecodes, then those bytecodes'
+non-compressible values. The pattern repeats to the end of the file
+or a code with value 252.
+
address@hidden 2: ZLIB compression
+The data record consists of the following, in order:
+
address@hidden @bullet
address@hidden
+ZLIB data header, 24 bytes long.
+
address@hidden
+One or more variable-length blocks of ZLIB compressed data.
+
address@hidden
+ZLIB data trailer, with a 24-byte fixed header plus an additional 24
+bytes for each preceding ZLIB compressed data block.
address@hidden itemize
+
+The ZLIB data header has the following format:
+
address@hidden
+int64 zheader_ofs;
+int64 ztrailer_ofs;
+int64 ztrailer_len;
address@hidden example
+
address@hidden @code
address@hidden int64 zheader_ofs;
+The offset, in bytes, of the beginning of this structure within the
+system file.
+
address@hidden int64 ztrailer_ofs;
+The offset, in bytes, of the first byte of the ZLIB data trailer.
+
address@hidden int64 ztrailer_len;
+The number of bytes in the ZLIB data trailer. This and the previous
+field sum to the size of the system file in bytes.
address@hidden table
+
+The data header is followed by @code{(ztrailer_ofs - 24) / 24} ZLIB
+compressed data blocks. Each ZLIB compressed data block begins with a
+ZLIB header as specified in address@hidden, e.g.@: hex bytes @code{78
+01} (the only header yet observed in practice). Each block
+decompresses to a fixed number of bytes (in practice only
address@hidden blocks have been observed), except that the last
+block of data may be shorter. The last ZLIB compressed data block
+ends just before offset @code{ztrailer_ofs}.
+
+The result of ZLIB decompression is bytecode compressed data as
+described above for compression format 1.
+
+The ZLIB data trailer begins with the following 24-byte fixed header:
+
address@hidden
+int64 bias;
+int64 zero;
+int32 block_size;
+int32 n_blocks;
address@hidden example
+
address@hidden @code
address@hidden int64 int_bias;
+The compression bias as a negative integer, e.g.@: if @code{bias} in
+the file header record is 100.0, then @code{int_bias} is @minus{}100
+(this is the only value yet observed in practice).
+
address@hidden int64 zero;
+Always observed to be zero.
+
address@hidden int32 block_size;
+The number of bytes in each ZLIB compressed data block, except
+possibly the last, following decompression. Only @code{0x3ff000} has
+been observed so far.
+
address@hidden int32 n_blocks;
+The number of ZLIB compressed data blocks, always exactly
address@hidden(ztrailer_ofs - 24) / 24}.
address@hidden table
+
+The fixed header is followed by @code{n_blocks} 24-byte ZLIB data
+block descriptors, each of which describes the compressed data block
+corresponding to its offset. Each block descriptor has the following
+format:
+
address@hidden
+int64 uncompressed_ofs;
+int64 compressed_ofs;
+int32 uncompressed_size;
+int32 compressed_size;
address@hidden example
+
address@hidden @code
address@hidden int64 uncompressed_ofs;
+The offset, in bytes, that this block of data would have in a similar
+system file that uses compression format 1. This is
address@hidden in the first block descriptor, and in each
+succeeding block descriptor it is the sum of the previous desciptor's
address@hidden and @code{uncompressed_size}.
+
address@hidden int64 compressed_ofs;
+The offset, in bytes, of the actual beginning of this compressed data
+block. This is @code{zheader_ofs + 24} in the first block descriptor,
+and in each succeeding block descriptor it is the sum of the previous
+descriptor's @code{compressed_ofs} and @code{compressed_size}. The
+final block descriptor's @code{compressed_ofs} and
address@hidden sum to @code{ztrailer_ofs}.
+
address@hidden int32 uncompressed_size;
+The number of bytes in this data block, after decompression. This is
address@hidden in every data block except the last, which may be
+smaller.
+
address@hidden int32 compressed_size;
+The number of bytes in this data block, as stored compressed in this
+system file.
address@hidden table
address@hidden table
+
@setfilename ignored
diff --git a/src/data/sys-file-private.h b/src/data/sys-file-private.h
index 21ff8ad..72f1ae3 100644
--- a/src/data/sys-file-private.h
+++ b/src/data/sys-file-private.h
@@ -1,5 +1,5 @@
/* PSPP - a program for statistical analysis.
- Copyright (C) 2006-2007, 2009-2012 Free Software Foundation, Inc.
+ Copyright (C) 2006-2007, 2009-2013 Free Software Foundation, Inc.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -35,12 +35,14 @@
struct dictionary;
-/* Magic numbers.
+/* ASCII magic numbers. */
+#define ASCII_MAGIC "$FL2" /* For regular files. */
+#define ASCII_ZMAGIC "$FL3" /* For ZLIB compressed files. */
- Both of these are actually $FL2 in the respective character set. The "FL2"
- part is invariant among national variants of each character set, but "$" has
- different encodings, so it is safer to write them as hexadecimal. */
-#define ASCII_MAGIC "\x24\x46\x4c\x32"
+/* EBCDIC magic number, the same as ASCII_MAGIC but encoded in EBCDIC.
+
+ No EBCDIC ZLIB compressed files have been observed, so we do not define
+ EBCDIC_ZMAGIC even though the value is obvious. */
#define EBCDIC_MAGIC "\x5b\xc6\xd3\xf2"
/* A variable in a system file. */
diff --git a/src/data/sys-file-reader.c b/src/data/sys-file-reader.c
index d553b3a..b6f5acf 100644
--- a/src/data/sys-file-reader.c
+++ b/src/data/sys-file-reader.c
@@ -24,6 +24,8 @@
#include <inttypes.h>
#include <setjmp.h>
#include <stdlib.h>
+#include <sys/stat.h>
+#include <zlib.h>
#include "data/attributes.h"
#include "data/case.h"
@@ -57,6 +59,7 @@
#include "gl/minmax.h"
#include "gl/unlocked-io.h"
#include "gl/xalloc.h"
+#include "gl/xalloc-oversized.h"
#include "gl/xsize.h"
#include "gettext.h"
@@ -173,11 +176,21 @@ struct sfm_reader
const char *encoding; /* String encoding. */
/* Decompression. */
- bool compressed; /* File is compressed? */
+ enum sfm_compression compression;
double bias; /* Compression bias, usually 100.0. */
uint8_t opcodes[8]; /* Current block of opcodes. */
size_t opcode_idx; /* Next opcode to interpret, 8 if none left. */
bool corruption_warning; /* Warned about possible corruption? */
+
+ /* ZLIB decompression. */
+ long long int ztrailer_ofs; /* Offset of ZLIB trailer at end of file. */
+#define ZIN_BUF_SIZE 4096
+ uint8_t *zin_buf; /* Inflation input buffer. */
+#define ZOUT_BUF_SIZE 16384
+ uint8_t *zout_buf; /* Inflation output buffer. */
+ unsigned int zout_end; /* Number of bytes of data in zout_buf. */
+ unsigned int zout_pos; /* First unconsumed byte in zout_buf. */
+ z_stream zstream; /* ZLIB inflater. */
};
static const struct casereader_class sys_file_casereader_class;
@@ -200,10 +213,19 @@ static void sys_error (struct sfm_reader *, off_t, const
char *, ...)
static void read_bytes (struct sfm_reader *, void *, size_t);
static bool try_read_bytes (struct sfm_reader *, void *, size_t);
static int read_int (struct sfm_reader *);
-static double read_float (struct sfm_reader *);
+static long long int read_int64 (struct sfm_reader *);
static void read_string (struct sfm_reader *, char *, size_t);
static void skip_bytes (struct sfm_reader *, size_t);
+/* ZLIB compressed data handling. */
+static void read_zheader (struct sfm_reader *);
+static void open_zstream (struct sfm_reader *);
+static void close_zstream (struct sfm_reader *);
+static bool read_bytes_zlib (struct sfm_reader *, void *, size_t);
+static void read_compressed_bytes (struct sfm_reader *, void *, size_t);
+static bool try_read_compressed_bytes (struct sfm_reader *, void *, size_t);
+static double read_compressed_float (struct sfm_reader *);
+
static char *fix_line_ends (const char *);
static int parse_int (struct sfm_reader *, const void *data, size_t ofs);
@@ -367,6 +389,7 @@ sfm_open_reader (struct file_handle *fh, const char
*volatile encoding,
r->error = false;
r->opcode_idx = sizeof r->opcodes;
r->corruption_warning = false;
+ r->zin_buf = r->zout_buf = NULL;
info = infop ? infop : xmalloc (sizeof *info);
memset (info, 0, sizeof *info);
@@ -472,6 +495,9 @@ sfm_open_reader (struct file_handle *fh, const char
*volatile encoding,
}
}
+ if (r->compression == SFM_COMP_ZLIB)
+ read_zheader (r);
+
/* Now actually parse what we read.
First, figure out the correct character encoding, because this determines
@@ -646,7 +672,9 @@ sfm_detect (FILE *file)
return false;
magic[4] = '\0';
- return !strcmp (ASCII_MAGIC, magic) || !strcmp (EBCDIC_MAGIC, magic);
+ return (!strcmp (ASCII_MAGIC, magic)
+ || !strcmp (ASCII_ZMAGIC, magic)
+ || !strcmp (EBCDIC_MAGIC, magic));
}
/* Reads the global header of the system file. Initializes *HEADER and *INFO,
@@ -658,12 +686,18 @@ read_header (struct sfm_reader *r, struct sfm_read_info
*info,
{
uint8_t raw_layout_code[4];
uint8_t raw_bias[8];
+ int compressed;
+ bool zmagic;
read_string (r, header->magic, sizeof header->magic);
read_string (r, header->eye_catcher, sizeof header->eye_catcher);
- if (strcmp (ASCII_MAGIC, header->magic)
- && strcmp (EBCDIC_MAGIC, header->magic))
+ if (!strcmp (ASCII_MAGIC, header->magic)
+ || !strcmp (EBCDIC_MAGIC, header->magic))
+ zmagic = false;
+ else if (!strcmp (ASCII_ZMAGIC, header->magic))
+ zmagic = true;
+ else
sys_error (r, 0, _("This is not an SPSS system file."));
/* Identify integer format. */
@@ -681,7 +715,25 @@ read_header (struct sfm_reader *r, struct sfm_read_info
*info,
|| header->nominal_case_size > INT_MAX / 16)
header->nominal_case_size = -1;
- r->compressed = read_int (r) != 0;
+ compressed = read_int (r);
+ if (!zmagic)
+ {
+ if (compressed == 0)
+ r->compression = SFM_COMP_NONE;
+ else if (compressed == 1)
+ r->compression = SFM_COMP_SIMPLE;
+ else if (compressed != 0)
+ sys_error (r, 0, "System file header has invalid compression "
+ "value %d.", compressed);
+ }
+ else
+ {
+ if (compressed == 2)
+ r->compression = SFM_COMP_ZLIB;
+ else
+ sys_error (r, 0, "ZLIB-compressed system file header has invalid "
+ "compression value %d.", compressed);
+ }
header->weight_idx = read_int (r);
@@ -723,7 +775,7 @@ read_header (struct sfm_reader *r, struct sfm_read_info
*info,
info->integer_format = r->integer_format;
info->float_format = r->float_format;
- info->compressed = r->compressed;
+ info->compression = r->compression;
info->case_cnt = r->case_cnt;
}
@@ -2289,7 +2341,7 @@ read_error (struct casereader *r, const struct sfm_reader
*sfm)
static bool
read_case_number (struct sfm_reader *r, double *d)
{
- if (!r->compressed)
+ if (r->compression == SFM_COMP_NONE)
{
uint8_t number[8];
if (!try_read_bytes (r, number, sizeof number))
@@ -2339,13 +2391,13 @@ read_case_string (struct sfm_reader *r, uint8_t *s,
size_t length)
static int
read_opcode (struct sfm_reader *r)
{
- assert (r->compressed);
+ assert (r->compression != SFM_COMP_NONE);
for (;;)
{
int opcode;
if (r->opcode_idx >= sizeof r->opcodes)
{
- if (!try_read_bytes (r, r->opcodes, sizeof r->opcodes))
+ if (!try_read_compressed_bytes (r, r->opcodes, sizeof r->opcodes))
return -1;
r->opcode_idx = 0;
}
@@ -2370,7 +2422,7 @@ read_compressed_number (struct sfm_reader *r, double *d)
return false;
case 253:
- *d = read_float (r);
+ *d = read_compressed_float (r);
break;
case 254:
@@ -2411,7 +2463,7 @@ read_compressed_string (struct sfm_reader *r, uint8_t
*dst)
return false;
case 253:
- read_bytes (r, dst, 8);
+ read_compressed_bytes (r, dst, 8);
break;
case 254:
@@ -2453,7 +2505,7 @@ static bool
read_whole_strings (struct sfm_reader *r, uint8_t *s, size_t length)
{
assert (length % 8 == 0);
- if (!r->compressed)
+ if (r->compression == SFM_COMP_NONE)
return try_read_bytes (r, s, length);
else
{
@@ -2820,14 +2872,14 @@ read_int (struct sfm_reader *r)
return integer_get (r->integer_format, integer, sizeof integer);
}
-/* Reads a 64-bit floating-point number from R and returns its
- value in host format. */
-static double
-read_float (struct sfm_reader *r)
+/* Reads a 64-bit signed integer from R and returns its value in
+ host format. */
+static long long int
+read_int64 (struct sfm_reader *r)
{
- uint8_t number[8];
- read_bytes (r, number, sizeof number);
- return float_get_double (r->float_format, number);
+ uint8_t integer[8];
+ read_bytes (r, integer, sizeof integer);
+ return integer_get (r->integer_format, integer, sizeof integer);
}
static int
@@ -2894,6 +2946,179 @@ fix_line_ends (const char *s)
return dst;
}
+static void *
+zalloc (voidpf pool_, uInt items, uInt size)
+{
+ struct pool *pool = pool_;
+
+ return (!size || xalloc_oversized (items, size)
+ ? Z_NULL
+ : pool_malloc (pool, items * size));
+}
+
+static void
+zfree (voidpf pool_, voidpf address)
+{
+ struct pool *pool = pool_;
+
+ pool_free (pool, address);
+}
+
+static void
+read_zheader (struct sfm_reader *r)
+{
+ off_t pos = r->pos;
+ long long int zheader_ofs = read_int64 (r);
+ long long int ztrailer_ofs = read_int64 (r);
+ long long int ztrailer_len = read_int64 (r);
+ struct stat s;
+
+ if (zheader_ofs != pos)
+ sys_error (r, pos, _("Wrong ZLIB data header offset 0x%llx."),
+ zheader_ofs);
+
+ if (ztrailer_ofs < r->pos)
+ sys_error (r, pos, _("Impossible ZLIB trailer offset 0x%llx."),
+ ztrailer_ofs);
+
+ if (ztrailer_len < 24 || ztrailer_len % 24)
+ sys_error (r, pos, _("Invalid ZLIB trailer length %lld."), ztrailer_len);
+
+ if (!fstat(fileno(r->file), &s)
+ && ztrailer_ofs + ztrailer_len != s.st_size)
+ sys_warn (r, pos,
+ _("End of ZLIB trailer (0x%llx) is not file size (0x%llx)."),
+ ztrailer_ofs + ztrailer_len, (long long int) s.st_size);
+
+ r->ztrailer_ofs = ztrailer_ofs;
+
+ if (r->zin_buf == NULL)
+ {
+ r->zin_buf = pool_malloc (r->pool, ZIN_BUF_SIZE);
+ r->zout_buf = pool_malloc (r->pool, ZOUT_BUF_SIZE);
+ r->zstream.next_in = NULL;
+ r->zstream.avail_in = 0;
+ }
+
+ r->zstream.zalloc = zalloc;
+ r->zstream.zfree = zfree;
+ r->zstream.opaque = r->pool;
+
+ open_zstream (r);
+}
+
+static void
+open_zstream (struct sfm_reader *r)
+{
+ int error;
+
+ r->zout_pos = r->zout_end = 0;
+ error = inflateInit (&r->zstream);
+ if (error != Z_OK)
+ sys_error (r, r->pos, _("ZLIB initialization failed (%s)."),
+ r->zstream.msg);
+}
+
+static void
+close_zstream (struct sfm_reader *r)
+{
+ int error;
+
+ error = inflateEnd (&r->zstream);
+ if (error != Z_OK)
+ sys_error (r, r->pos, _("Inconsistency at end of ZLIB stream (%s)."),
+ r->zstream.msg);
+}
+
+static bool
+read_bytes_zlib (struct sfm_reader *r, void *buf_, size_t byte_cnt)
+{
+ uint8_t *buf = buf_;
+
+ if (byte_cnt == 0)
+ return true;
+
+ for (;;)
+ {
+ int error;
+
+ /* Use already inflated data if there is any. */
+ if (r->zout_pos < r->zout_end)
+ {
+ unsigned int n = MIN (byte_cnt, r->zout_end - r->zout_pos);
+ memcpy (buf, &r->zout_buf[r->zout_pos], n);
+ r->zout_pos += n;
+ byte_cnt -= n;
+ buf += n;
+
+ if (byte_cnt == 0)
+ return true;
+ }
+
+ /* We need to inflate some more data.
+ Get some more input data if we don't have any. */
+ if (r->zstream.avail_in == 0)
+ {
+ unsigned int n = MIN (ZIN_BUF_SIZE, r->ztrailer_ofs - r->pos);
+ if (n == 0 || !try_read_bytes (r, r->zin_buf, n))
+ return false;
+ r->zstream.avail_in = n;
+ r->zstream.next_in = r->zin_buf;
+ }
+
+ /* Inflate the (remaining) input data. */
+ r->zstream.avail_out = ZOUT_BUF_SIZE;
+ r->zstream.next_out = r->zout_buf;
+ error = inflate (&r->zstream, Z_SYNC_FLUSH);
+ r->zout_pos = 0;
+ r->zout_end = r->zstream.next_out - r->zout_buf;
+ if (r->zout_end == 0)
+ {
+ if (error == Z_STREAM_END)
+ {
+ close_zstream (r);
+ open_zstream (r);
+ }
+ else
+ sys_error (r, r->pos, _("ZLIB stream inconsistency (%s)."),
+ r->zstream.msg);
+ }
+ else
+ {
+ /* Process the output data and ignore 'error' for now. ZLIB will
+ present it to us again on the next inflate() call. */
+ }
+ }
+}
+
+static void
+read_compressed_bytes (struct sfm_reader *r, void *buf, size_t byte_cnt)
+{
+ if (r->compression == SFM_COMP_SIMPLE)
+ return read_bytes (r, buf, byte_cnt);
+ else if (!read_bytes_zlib (r, buf, byte_cnt))
+ sys_error (r, r->pos, _("Unexpected end of ZLIB compressed data."));
+}
+
+static bool
+try_read_compressed_bytes (struct sfm_reader *r, void *buf, size_t byte_cnt)
+{
+ if (r->compression == SFM_COMP_SIMPLE)
+ return try_read_bytes (r, buf, byte_cnt);
+ else
+ return read_bytes_zlib (r, buf, byte_cnt);
+}
+
+/* Reads a 64-bit floating-point number from R and returns its
+ value in host format. */
+static double
+read_compressed_float (struct sfm_reader *r)
+{
+ uint8_t number[8];
+ read_compressed_bytes (r, number, sizeof number);
+ return float_get_double (r->float_format, number);
+}
+
static const struct casereader_class sys_file_casereader_class =
{
sys_file_casereader_read,
diff --git a/src/data/sys-file-reader.h b/src/data/sys-file-reader.h
index 037d33a..52457a0 100644
--- a/src/data/sys-file-reader.h
+++ b/src/data/sys-file-reader.h
@@ -26,6 +26,14 @@
/* Reading system files. */
+/* System file compression format. */
+enum sfm_compression
+ {
+ SFM_COMP_NONE, /* No compression. */
+ SFM_COMP_SIMPLE, /* Bytecode compression of integer values. */
+ SFM_COMP_ZLIB /* ZLIB "deflate" compression. */
+ };
+
/* System file info that doesn't fit in struct dictionary.
The strings in this structure are encoded in UTF-8. (They are normally in
@@ -36,7 +44,7 @@ struct sfm_read_info
char *creation_time; /* "hh:mm:ss". */
enum integer_format integer_format;
enum float_format float_format;
- bool compressed; /* 0=no, 1=yes. */
+ enum sfm_compression compression;
casenumber case_cnt; /* -1 if unknown. */
char *product; /* Product name. */
char *product_ext; /* Extra product info. */
diff --git a/src/language/dictionary/sys-file-info.c
b/src/language/dictionary/sys-file-info.c
index 3327a2c..c7f326f 100644
--- a/src/language/dictionary/sys-file-info.c
+++ b/src/language/dictionary/sys-file-info.c
@@ -150,10 +150,11 @@ cmd_sysfile_info (struct lexer *lexer, struct dataset *ds
UNUSED)
? var_get_name (weight_var) : _("Not weighted.")));
}
- tab_text (t, 0, r, TAB_LEFT, _("Mode:"));
+ tab_text (t, 0, r, TAB_LEFT, _("Compression:"));
tab_text_format (t, 1, r++, TAB_LEFT,
- _("Compression %s."), info.compressed ? _("on") : _("off"));
-
+ info.compression == SFM_COMP_NONE ? _("None")
+ : info.compression == SFM_COMP_SIMPLE ? "SAV"
+ : "ZSAV");
tab_text (t, 0, r, TAB_LEFT, _("Charset:"));
tab_text (t, 1, r++, TAB_LEFT, dict_get_encoding (d));
diff --git a/utilities/pspp-dump-sav.c b/utilities/pspp-dump-sav.c
index c6b5823..8eaf836 100644
--- a/utilities/pspp-dump-sav.c
+++ b/utilities/pspp-dump-sav.c
@@ -39,6 +39,13 @@
#define ID_MAX_LEN 64
+enum compression
+ {
+ COMP_NONE,
+ COMP_SIMPLE,
+ COMP_ZLIB
+ };
+
struct sfm_reader
{
const char *file_name;
@@ -52,7 +59,7 @@ struct sfm_reader
enum integer_format integer_format;
enum float_format float_format;
- bool compressed;
+ enum compression compression;
double bias;
};
@@ -87,7 +94,8 @@ static void read_long_string_missing_values (struct
sfm_reader *r,
size_t size, size_t count);
static void read_unknown_extension (struct sfm_reader *,
size_t size, size_t count);
-static void read_compressed_data (struct sfm_reader *, int max_cases);
+static void read_simple_compressed_data (struct sfm_reader *, int max_cases);
+static void read_zlib_compressed_data (struct sfm_reader *);
static struct text_record *open_text_record (
struct sfm_reader *, size_t size);
@@ -180,7 +188,7 @@ main (int argc, char *argv[])
r.n_var_widths = 0;
r.allocated_var_widths = 0;
r.var_widths = 0;
- r.compressed = false;
+ r.compression = COMP_NONE;
if (argc - optind > 1)
printf ("Reading \"%s\":\n", r.file_name);
@@ -218,8 +226,13 @@ main (int argc, char *argv[])
(long long int) ftello (r.file),
(long long int) ftello (r.file) + 4);
- if (r.compressed && max_cases > 0)
- read_compressed_data (&r, max_cases);
+ if (r.compression == COMP_SIMPLE)
+ {
+ if (max_cases > 0)
+ read_simple_compressed_data (&r, max_cases);
+ }
+ else if (r.compression == COMP_ZLIB)
+ read_zlib_compressed_data (&r);
fclose (r.file);
}
@@ -241,11 +254,16 @@ read_header (struct sfm_reader *r)
char creation_date[10];
char creation_time[9];
char file_label[65];
+ bool zmagic;
read_string (r, rec_type, sizeof rec_type);
read_string (r, eye_catcher, sizeof eye_catcher);
- if (strcmp ("$FL2", rec_type) != 0)
+ if (!strcmp ("$FL2", rec_type))
+ zmagic = false;
+ else if (!strcmp ("$FL3", rec_type))
+ zmagic = true;
+ else
sys_error (r, "This is not an SPSS system file.");
/* Identify integer format. */
@@ -265,7 +283,24 @@ read_header (struct sfm_reader *r)
weight_index = read_int (r);
ncases = read_int (r);
- r->compressed = compressed != 0;
+ if (!zmagic)
+ {
+ if (compressed == 0)
+ r->compression = COMP_NONE;
+ else if (compressed == 1)
+ r->compression = COMP_SIMPLE;
+ else if (compressed != 0)
+ sys_error (r, "SAV file header has invalid compression value "
+ "%"PRId32".", compressed);
+ }
+ else
+ {
+ if (compressed == 2)
+ r->compression = COMP_ZLIB;
+ else
+ sys_error (r, "ZSAV file header has invalid compression value "
+ "%"PRId32".", compressed);
+ }
/* Identify floating-point format and obtain compression bias. */
read_bytes (r, raw_bias, sizeof raw_bias);
@@ -289,7 +324,12 @@ read_header (struct sfm_reader *r)
printf ("File header record:\n");
printf ("\t%17s: %s\n", "Product name", eye_catcher);
printf ("\t%17s: %"PRId32"\n", "Layout code", layout_code);
- printf ("\t%17s: %"PRId32"\n", "Compressed", compressed);
+ printf ("\t%17s: %"PRId32" (%s)\n", "Compressed",
+ compressed,
+ r->compression == COMP_NONE ? "no compression"
+ : r->compression == COMP_SIMPLE ? "simple compression"
+ : r->compression == COMP_ZLIB ? "ZLIB compression"
+ : "<error>");
printf ("\t%17s: %"PRId32"\n", "Weight index", weight_index);
printf ("\t%17s: %"PRId32"\n", "Number of cases", ncases);
printf ("\t%17s: %g\n", "Compression bias", r->bias);
@@ -1170,7 +1210,7 @@ read_variable_attributes (struct sfm_reader *r, size_t
size, size_t count)
}
static void
-read_compressed_data (struct sfm_reader *r, int max_cases)
+read_simple_compressed_data (struct sfm_reader *r, int max_cases)
{
enum { N_OPCODES = 8 };
uint8_t opcodes[N_OPCODES];
@@ -1258,6 +1298,87 @@ read_compressed_data (struct sfm_reader *r, int
max_cases)
}
}
}
+
+static void
+read_zlib_compressed_data (struct sfm_reader *r)
+{
+ long long int ofs;
+ long long int this_ofs, next_ofs, next_len;
+ long long int bias, zero;
+ long long int running_uncmp_ofs, running_cmp_ofs;
+ unsigned int block_size, n_blocks;
+ unsigned int i;
+
+ read_int (r);
+ ofs = ftello (r->file);
+ printf ("\n%08llx: ZLIB compressed data header:\n", ofs);
+
+ this_ofs = read_int64 (r);
+ next_ofs = read_int64 (r);
+ next_len = read_int64 (r);
+
+ printf ("\tzheader_ofs: 0x%llx\n", this_ofs);
+ if (this_ofs != ofs)
+ printf ("\t\t(Expected 0x%llx.)\n", ofs);
+ printf ("\tztrailer_ofs: 0x%llx\n", next_ofs);
+ printf ("\tztrailer_len: %lld\n", next_len);
+ if (next_len < 24 || next_len % 24)
+ printf ("\t\t(Trailer length is not a positive multiple of 24.)\n");
+
+ printf ("\n%08llx: 0x%llx bytes of ZLIB compressed data\n",
+ ofs + 8 * 3, next_ofs - (ofs + 8 * 3));
+
+ skip_bytes (r, next_ofs - (ofs + 8 * 3));
+
+ printf ("\n%08llx: ZLIB trailer fixed header:\n", next_ofs);
+ bias = read_int64 (r);
+ zero = read_int64 (r);
+ block_size = read_int (r);
+ n_blocks = read_int (r);
+ printf ("\tbias: %lld\n", bias);
+ printf ("\tzero: 0x%llx\n", zero);
+ if (zero != 0)
+ printf ("\t\t(Expected 0.)\n");
+ printf ("\tblock_size: 0x%x\n", block_size);
+ if (block_size != 0x3ff000)
+ printf ("\t\t(Expected 0x3ff000.)\n");
+ printf ("\tn_blocks: %u\n", n_blocks);
+ if (n_blocks != next_len / 24 - 1)
+ printf ("\t\t(Expected %llu.)\n", next_len / 24 - 1);
+
+ running_uncmp_ofs = ofs;
+ running_cmp_ofs = ofs + 24;
+ for (i = 0; i < n_blocks; i++)
+ {
+ long long int blockinfo_ofs = ftello (r->file);
+ unsigned long long int uncompressed_ofs = read_int64 (r);
+ unsigned long long int compressed_ofs = read_int64 (r);
+ unsigned int uncompressed_size = read_int (r);
+ unsigned int compressed_size = read_int (r);
+
+ printf ("\n%08llx: ZLIB block descriptor %d\n", blockinfo_ofs, i + 1);
+
+ printf ("\tuncompressed_ofs: 0x%llx\n", uncompressed_ofs);
+ if (i == 0 && uncompressed_ofs != running_uncmp_ofs)
+ printf ("\t\t(Expected 0x%llx.)\n", ofs);
+
+ printf ("\tcompressed_ofs: 0x%llx\n", compressed_ofs);
+ if (i == 0 && compressed_ofs != running_cmp_ofs)
+ printf ("\t\t(Expected 0x%llx.)\n", ofs + 24);
+
+ printf ("\tuncompressed_size: 0x%x\n", uncompressed_size);
+ if (i < n_blocks - 1 && uncompressed_size != block_size)
+ printf ("\t\t(Expected 0x%x.)\n", block_size);
+
+ printf ("\tcompressed_size: 0x%x\n", compressed_size);
+ if (i == n_blocks - 1 && compressed_ofs + compressed_size != next_ofs)
+ printf ("\t\t(This was expected to be 0x%llx.)\n",
+ next_ofs - compressed_size);
+
+ running_uncmp_ofs += uncompressed_size;
+ running_cmp_ofs += compressed_size;
+ }
+}
/* Helpers for reading records that consist of structured text
strings. */
--
1.7.10.4
- first draft ZSAV implementation,
Ben Pfaff <=