pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: contribution and cvs


From: Ben Pfaff
Subject: Re: contribution and cvs
Date: Thu, 01 Sep 2005 13:07:11 -0700
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

Jason Stover <address@hidden> writes:

> Recoding a categorical variable's values as a vectors with binary
> entries is a basic necessity for most statistical procedures which
> use categorical data. PSPP must pass the data once to recode
> those values, so it would be nice if the struct variable held those
> binary vectors, even after the procedure that created them exits, thereby
> making the vectors available to the next procedure. There would be one
> binary vector per distinct value.
>
> But, by the comment above, v->aux can hold the binary vectors only until
> someone else needs to hold other auxiliary data.
>
> The code I wrote before did not add anything to the struct variable,
> but to make it work I had to create a struct
> recoded_categorical_array. The recoded_categorical_array is cumbersome
> and would be unnecessary if the variable values could be stored inside
> the struct variable.  So may I/we/someone add a gsl_matrix * to the
> definition of struct variable? Doing so will make a lot of numerical
> routines easier to write.

I'm not against adding a member if that's the best solution, but
I'd like to learn more.  If I understand correctly, the primary
purpose of the matrix is to identify the values that the variable
actually takes on.  Is that correct?  If so, then I have two
concerns.

First, how do we track changes to the data in the active file
between procedures?  If the user does something like
        COMPUTE x = x + 1.
or
        SELECT IF x NE 1.
then this means that we have to invalidate the cache, but
currently there isn't any mechanism for that.  We want such a
cache and invalidation mechanism for other reasons, so it's
becoming increasingly clear to me that it's something to
implement soon, but it's not there yet.

Second, is this the best way to represent this data (as you say)?
If I'm correct in that the matrix mainly identifies a variables'
data values, then perhaps we should really be storing a frequency
table for the variable, and transforming that into the
appropriate gsl_matrix as needed.  After all, a lot of procedures
could find a frequency table useful (right?).
-- 
"In this world that Hugh Heffner had made,
 he alone seemed forever bunnyless."
--John D. MacDonald




reply via email to

[Prev in Thread] Current Thread [Next in Thread]