Re: data sets and caching

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: data sets and caching

From:	Jason Stover
Subject:	Re: data sets and caching
Date:	Tue, 1 Nov 2005 22:12:37 +0000
User-agent:	Mutt/1.4.2.1i

On Mon, Oct 31, 2005 at 03:09:29PM -0800, Ben Pfaff wrote:
> Jason Stover <address@hidden> writes:
> 
> > On Mon, Oct 31, 2005 at 10:25:20AM -0800, Ben Pfaff wrote:
> >> Jason Stover <address@hidden> writes:
> >> 
> >> > I need to be able to append residuals to the active file
> >> > with a 'save' subcommand. How should I go about this?
> >> 
> >> Would you like to save them for a single session only,
> >> or should it be possible to save them to disk and retrieve them
> >> in later sessions?
> >
> > Good question. I had intended to save them to the working data file,
> > as the SPSS SAVE subcommand does in its regression procedure.  Users
> > mostly like to look at residuals and run tests on them after the model
> > has been fit. But if this working data file is written to disk, the
> > residuals are written with it, and can be used later. 
> 
> Ah, so the models would be included as part of the working file
> dictionary?  That's a workable idea.  (SPSS already does
> something like this, you say?)
> 

Maybe. I'm not sure where to store the 'model object'. Below is
a description of what SPSS does. Residuals would definitely be
in the dictionary.

First, I should draw a distinction between 'residuals' and 'models'
here. Pardon me if I'm saying something everyone already knows.

The original question above was about 'residuals', not the entire
model. The 'model' as an object inside PSPP should be a collection
of estimated parameters, some other information, and, in some cases, one
or more pointers to functions that use those parameters to make predictions. 
Exactly
what belongs in a 'model object' depends on what the 'model' is in
a statistical sense, and how someone might use that model. 

The 'residuals' in linear regression are the errors incurred by using the
model to predict the dependent variable. E.g., if we have a variable Y
and fit a regression model with a single explanatory variable X:

        Y[i] \approx b0 + b1 * X[i]

where i is the case number, then the residuals are the values 

        Y[i] - (b0 + b1 * X[i])

and there are as many residuals as cases.

The 'save' subcommand appends the residuals to the current working
file, as a new variable. The residuals aren't really part of a
'statistical model', but some 'model caches' in PSPP should probably
include residuals. And yes, the residuals should be included in the
dictionary, to answer the question above.

About saving model information in the dictionary: Though SPSS does not
save the entire 'model', it has a 'matrix' subcommand. That subcommand
saves some model information either on disk, or in the working file. In
the latter case, the old working file disappears. This subcommand
seems kludgey for two reasons. First, the subcommand does not
save enough relevant information about the model. Second, by failing
to create a nice, reusable data structure 'behind the scene', it does
not allow a user to name a model and use it later. 

I don't think we want a model object to overwrite the working file
(except to make PSPP a 'clone'), but we may want to include 
a pointer to a model in the dictionary of the data set used
to build that model. 

In any case, users will want to be able to refer back to old models,
but not necessarily the data sets used to fit them, and vice versa.
Therefore, the models and data sets/dictionaries should be distinct.

> Should it be possible to save them to and retrieve them from
> separate files?  (Maybe the SAVE/XSAVE command could support an
> option that saves models without associated data.)

I think this is a swell idea. Right now, SPSS' OUTFILE subcommand with
the MODEL keyword saves the model information as XML. 

Given that PSPP should be able to save a model object, then load it
later, perhaps with an entirely different, inappropriate data set, the
model object should probably store some information from the
dictionary in use at the time the model was created. In particular,
that model object should know if a user asks it to do something
impossible, such as using a string variable in a place where a numeric
variable belongs.

-Jason

-- 
address@hidden
SDF Public Access UNIX System - http://sdf.lonestar.org

[Prev in Thread]

Current Thread

[Next in Thread]

Re: data sets and caching, Jason Stover <=
- Re: data sets and caching, Ben Pfaff, 2005/11/02

Prev by Date: Re: sfm-read.c
Next by Date: Re: data sets and caching
Previous by thread: Re: sfm-read.c
Next by thread: Re: data sets and caching
Index(es):
- Date
- Thread