[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Unknown fields in table input text files
From: |
Greg Chicares |
Subject: |
Re: [lmi] Unknown fields in table input text files |
Date: |
Sat, 20 Feb 2016 14:33:01 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.5.0 |
On 2016-02-20 12:57, Vadim Zeitlin wrote:
> On Sat, 20 Feb 2016 04:12:25 +0000 Greg Chicares <address@hidden> wrote:
>
> GC> On 2016-02-20 03:16, Vadim Zeitlin wrote:
> GC> >
> GC> > I decided to extend my tests checking that all tables in qx_ins and
> qx_cso
> GC> > databases survive the round trip through the new table code to also do
> the
> GC> > same for the tables in qx_ann and got several failures due to the
> presence
> GC> > of unknown "fields" in some of the tables here.
[...]
> GC> > One of them looks like a real field as it's present in several files:
> it's
> GC> > the "Editor: " one. I don't know at all what to do about it as there is
> no
> GC> > corresponding field in the binary format, so there doesn't seem to be
> any
> GC> > way to store the value of this field in it.
> GC>
> GC> Please tell me the number of a 'qx_ann' table that has this field so that
> GC> I can examine it. I don't remember ever seeing "Editor:" in these files.
>
> It occurs in the following tables:
>
> 893 894 895 896 897 898 952 953 954 955 956 957 958 959 960 961 962 963 964
> 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983
> 984
Let's examine table 893. Text output:
Table number: 893
Table type: Aggregate
Contributor: Roger Scott Lumsden address@hidden
Source of data: Population mortality (projected and modified for Pensioners)
Unit of observation: Number of lives
...
Comments: These are supposed to represent the expected mortality of pensioners
from
the generation born in 1950, updated through 1990-92 census results.
This is from the diskette available with
"The Second Actuarial Study of Mortality in Europe"
Editor: A.S.MacDonald
In the raw data,
0574740 stx nul eot nul } etx nul nul etx nul soh nul A eot nul -
0574760 nul R o g e r sp S c o t t sp L u m
0575000 s d e n sp 7 5 1 4 7 . 2 6 2 0 @
0575020 c o m p u s e r v e . c o m enq nul
0575040 < nul P o p u l a t i o n sp m o r
0575060 t a l i t y sp ( p r o j e c t e
0575100 d sp a n d sp m o d i f i e d sp f
0575120 o r sp P e n s i o n e r s ) bs nul
0575140 si nul N u m b e r sp o f sp l i v e
0575160 s ht nul C soh B a s e l i n e sp c a
we can pick out the records:
[Contributor] eot nul - nul R o g e r sp ...
[Source of data] enq nul < nul P o p u l a t i o n ...
[Unit of observation] bs nul si nul N u m b e r sp o f ...
Record "titles" like "Contributor" aren't spelled out; presumably the
four bytes preceding the raw record contents represent the record type.
The source code says:
#define DT_contributor 4
#define DT_dataSource 5
#define DT_unitOfObs 8
#define DT_comments 11
and indeed 4 = EOT, 5 = ENQ, and 8 = BS. Record 11 = \013 = VT is:
0576260 vt nul soh soh T h e s e sp a r e sp s u
0576300 p p o s e d sp t o sp r e p r e s
0576320 e n t sp t h e sp e x p e c t e d
0576340 sp m o r t a l i t y sp o f sp p e
0576360 n s i o n e r s sp f r o m nl t h
0576400 e sp g e n e r a t i o n sp b o r
0576420 n sp i n sp 1 9 5 0 , sp u p d a t
0576440 e d sp t h r o u g h sp 1 9 9 0 -
0576460 9 2 sp c e n s u s sp r e s u l t
0576500 s . nl T h i s sp i s sp f r o m sp
0576520 t h e sp d i s k e t t e sp a v a
0576540 i l a b l e sp w i t h nl " T h e
0576560 sp S e c o n d sp A c t u a r i a
0576600 l sp S t u d y sp o f sp M o r t a
0576620 l i t y sp i n sp E u r o p e " nl
0576640 E d i t o r : sp A . S . M a c D
0576660 o n a l d dle nul stx nul ack nul ff nul stx nul soh
So "Comments:" corresponds to VT, and is not spelled out in the file,
while "Editor:", beginning at 0576640, is just a text string contained
in the "Comment" record.
> It is actually part of "Comments:" in the binary files, but it surely
> looks like just another header (similar to e.g. "Contributor") in the text
> format.
Yes, to the naive reader, it would appear that this file has a novel
"Editor" record; but in fact it does not. Therefore...
> GC> I have two suggestions:
[...transposing them...]
> GC> (2) Use a regex like /[A-Za-z0-9]* *[A-Za-z0-9]*:/ on the assumption that
> GC> header names consist of one or two words followed by a colon. Deem any
> GC> colon that occurs later in the line to be content rather than markup.
This cannot work. A "Contributor" specified as
"\nSource of data:\Table number:\nContributor:"
cannot be parsed this way.
> Yes, I definitely need to do this to avoid at least the obvious false
> positives. The trouble with "Editor:" and "WARNING:" is that they're not
> really obvious, are they.
Actually, we must not do this. And "Editor:" and "WARNING:" are not
record titles and do not begin new records. Records are indicated
by prefixed bytes like EOT and VT. (Therefore, record content must
not include those bytes.)
> GC> (1) Build a whitelist of header names, and reject anything not on the
> list.
> GC> I imagine that this list will be short; I thought they were enumerated
> GC> in the 1990s code, and perhaps also in the HLP or GID documentation.
The place to start is the macro pseudo-enum in 'dectable.cpp':
#define DT_tableName 1
#define DT_tableNumber 2
...
#define DT_hashValue 18
IIRC, after they published that code, SOA added some new record types,
which are documented nowhere and can be discovered only by the sort of
testing you're doing.
> Would we include "WARNING" in this whitelist?
No. It's not a record type.
- [lmi] Unknown fields in table input text files, Vadim Zeitlin, 2016/02/19
- Re: [lmi] Unknown fields in table input text files, Greg Chicares, 2016/02/19
- Re: [lmi] Unknown fields in table input text files, Vadim Zeitlin, 2016/02/20
- Re: [lmi] Unknown fields in table input text files,
Greg Chicares <=
- Re: [lmi] Unknown fields in table input text files, Vadim Zeitlin, 2016/02/20
- Re: [lmi] Unknown fields in table input text files, Greg Chicares, 2016/02/21
- Re: [lmi] Unknown fields in table input text files, Vadim Zeitlin, 2016/02/21
- Re: [lmi] Unknown fields in table input text files, Greg Chicares, 2016/02/21
- Re: [lmi] Unknown fields in table input text files, Vadim Zeitlin, 2016/02/21
- Re: [lmi] Unknown fields in table input text files, Greg Chicares, 2016/02/21
- Re: [lmi] Unknown fields in table input text files, Vadim Zeitlin, 2016/02/21
- Re: [lmi] Unknown fields in table input text files, Greg Chicares, 2016/02/21
- Re: [lmi] Unknown fields in table input text files, Vadim Zeitlin, 2016/02/21