lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] an xml schema for (single|multiple)_cell_document file XML for


From: Greg Chicares
Subject: Re: [lmi] an xml schema for (single|multiple)_cell_document file XML format
Date: Mon, 12 Mar 2012 19:19:10 +0000
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2

On 2012-03-09 18:27Z, Václav Slavík wrote:
> 
> I have some questions about the format:
> 
> (1) Are the elements under <cell> optional or required? As far as I
> can tell, the reading code is permissive and will use defaults if a
> value is missing, but should that be considered a valid file?

The schema should treat every element under <cell> as required,
even though the C++ code does not enforce such a requirement. Now,
why would two methods of reading xml files representing lmi input
deliberately behave differently in this respect?

Input::AscribeMembers() specifies the elements used in xml files
(and member_names() gives their names as a vector of strings).
Abstractly, each is a field; concretely, each is an xml element
and a C++ data member. From time to time, fields are added,
renamed, or removed, so there's a progression of
  class versions C0, C1, ... Cn, and
  xml   versions X0, X1, ... Xn.
For any lmi production release 'i', the fields in Ci and Xi are
identical (the mapping is a bijection) as long as the xml file
is written by lmi itself. This occurs naturally, because the
code that reads and writes the xml iterates over the list of
fields supplied by class Input.

Backward compatibility is achieved by the Input::redintegrate_*
functions. When we release version 'm', the class is updated,
and C0...Cm-1 no longer exist. Those functions translate
  X0 -> Cm
  X1 -> Cm
  ...
  Xm -> Cm
as well as can be done (but no better--e.g., a default value is
used for a field that didn't exist when the xml file was saved).
For exact version 'm', we could assert that the mapping Xm <-> Cm
is bijective, but we don't: that's trivially true for files that
lmi itself creates, whereas external files are likely to use an
earlier version Xj, j<m, for which Cj no longer exists; and such
an assertion would make it less convenient for me to experiment
with changes.

One problem that a schema can solve is that lmi's permissive code
accepts a version Xj' that's only a subset of Xj, but a later
version Ck may require the omitted elements, so that Ck doesn't
accept Xj'. That's exactly what happened not too long ago, and we
had to rush this code into production to fix the "breakage":

    // Defectively, some admin-system extracts lack various elements.
    // All deficient extracts happen to lack <AgentPhone>: they don't
    // even have an empty <AgentPhone/>; but that field is present in
    // every input file ever saved by any version of lmi.
    bool const deficient_extract = contains(residuary_names, "AgentPhone");
...
    // For "deficient" extracts, these required elements are missing.
    if(file_version < 6 && !deficient_extract)
        {
        LMI_ASSERT(contains(residuary_names, "SolveFromWhich"));
        LMI_ASSERT(contains(residuary_names, "SolveTgtAtWhich"));
        LMI_ASSERT(contains(residuary_names, "SolveToWhich"));
        LMI_ASSERT(contains(residuary_names, "UseDOB"));
        SolveFromWhich  = map_lookup(detritus_map, "DeprecatedSolveFromWhich");
        SolveTgtAtWhich = map_lookup(detritus_map, "DeprecatedSolveTgtAtWhich");
        SolveToWhich    = map_lookup(detritus_map, "DeprecatedSolveToWhich");
        UseDOB          = map_lookup(detritus_map, "DeprecatedUseDOB");
        }

Of course, that's not the only problem a schema would find. Another
example: an external system documented that it would pass US state
to us as the standard two-letter US Postal Service state code, but we
received a file with state "FC". "Forth Carolina"? Data error? No,
that's how they coded "Foreign Country".

Schema annotations would also be handy for narrative descriptions of
fields whose meaning is not intuitively obvious to everyone.

> (2) Does the order of elements under <cell> matter? (The output is
> always alphabetically sorted with current code, reading code doesn't
> care. It's marginally simpler to write the grammar if the order
> matters, but both are easily possible.)

I don't see a good reason not to specify that elements must be sorted
in the exact order given in the schema.

> (3) Are empty class_defaults and particular_cells allowed, or do they
> have to contain at least one cell each?

I'd rather say that each of
    case_default
    class_defaults
    particular_cells
must contain at least one <cell>.

> Also, do you want to describe previous versions of the format or just
> limit ourselves to the latest version (i.e. what is written by LMI)?
> (RELAX NG can easily use different grammars based on the value of the
> 'version' attribute, but it wouldn't be translatable into XML Schema
> and getting the historical data from SVN would probably be a bit of a
> pain, so I obviously prefer to stick to the latest.)

Let's ignore all pre-2010 versions, and support current version six
and later versions only: prior history doesn't matter.

But we will want a schema for version seven when it comes into being.
Not knowing much about schema languages, I was thinking it'd be okay
to have a separate schema for each of versions 6, 7, and so on. An
external system might still use version seven when lmi moves to version
eight--for instance, because the new version adds a field that wouldn't
mean anything to that external system (and those systems are costly to
update). Perhaps you'd design it with multiple files:
  schema-v6.rng
  schema-v7.rng [future]
  generally-useful-stuff.rng [potentially shared by -v6 and -v7]
but would it be so horrible to duplicate a file:
  generally-useful-stuff-v6.rng
  generally-useful-stuff-v7.rng
whose contents, after all, might change? I suppose that would be less
bad than keeping obsolete versions of C++ classes, because in C++ we
need the freedom to refactor parts of a large system, and every '.cpp'
file adds to compilation overhead.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]