[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
data sets and caching
From: |
Jason Stover |
Subject: |
data sets and caching |
Date: |
Mon, 31 Oct 2005 16:42:57 +0000 |
User-agent: |
Mutt/1.4.2.1i |
(Apologies for the length of the message.)
I need to be able to append residuals to the active file
with a 'save' subcommand. How should I go about this?
And this question raises some others regarding the data-handling
and model fitting. Currently, PSPP, SPSS, SAS and most commercial
statistical software are written to allow a user to read data,
fit a model, and directly create output that humans read, as this
kind of syntax exemplifies:
regression /variables=v0 v1 v2 /statistics default /dependent=v2 /method=enter.
The data are read, a linear model is fit, and the user sees some
output like an anova table and parameter estimates. The current 'backend'
of PSPP, which does not cache any of its results, is limited
to this type of use.
Here is an example of syntax that shows what users would want
to be able to do (I'm using hypothetical syntax to illustrate
the idea):
regression /data=train_data /variables=v0 v1 v2 /statistics default
/dependent=v2 /method=enter /name=model1.
nlr /data=train_data /variables=v0 v1 v2 /statistics default /dependent=v2
/method=enter /name=model2.
model_compare /data=test_data model1 model2 /criteria ssresid absdev.
This syntax illustrates two design changes that would make pspp more flexible
for users.
1. The user can name the output from any procedure. The
procedure then creates a structure that is visible to other procedures
that might use it. This kind of design would allow for easier
comparison of models, as the hypothetical 'model_compare' statement
shows. It would also allow users to combine models, like this:
model_create /data=train_data /test=test_data /models model1 model2
/method=grad_boost.
This use allows users to build models from other models. The
practice has become popular in the last decade, and it is something
users will want. In R, each modeling procedure creates a model
'object', and this model object can then be examined/evaluated/used by
other procedures. The creation of caches by statistical procedures is
one reason for the popularity of R and Clementine. Clementine can
do something like the 'model_compare' statement above. SAS can do this to
a limited degree; SPSS cannot.
I have tried to show how a procedure can create such a cache with
the regression procedure, which creates a pspp_linreg_cache. If the
regression procedure gave that cache a name (assuming the user wanted
to do so), I could just remove the 'pspp_linreg_cache_free (lcache);'
statement in regression.q and the cache could be used later. A garbage
collector could free the memory when the cache is no longer needed.
2. Users can name data sets to be used in a procedure. Then PSPP could
fit models to different data sets and evaluate them using a 'test'
data set. PSPP could also be made to manipulate multiple data sets
(such as merging them). SAS users spend a lot of time sorting,
merging, concatenating and de-duplicating data sets. SPSS does not
allow this, and that is one reason for SAS' popularity. PSPP's
inability to do this makes it less attractive to users. I know
this functionality lies beyond cloning SPSS, but it is functionality
users find important, and other free statistical software can't do it
(as far as I know). R names each data set, and it can sort, but users
cannot combine and de-duplicate data sets as easily as they can with
SAS. R cannot work with the large data sets that SAS can use, either.
I know implementing these ideas might be a lot of work, but they would
make PSPP immensely more useful. I do not think the model-caching is beyond
the plan for PSPP since most of it (as far as I can see) involves making
procedures create objects that can be used later. I do not know as
much about data-shuffling, so I can't comment on that.
-Jason
--
address@hidden
SDF Public Access UNIX System - http://sdf.lonestar.org
- data sets and caching,
Jason Stover <=