pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: covariance test success


From: John Darrington
Subject: Re: covariance test success
Date: Sat, 21 Nov 2009 08:12:00 +0000
User-agent: Mutt/1.5.18 (2008-05-17)

Interactions are not so onerous as they appear.   If I have grasped the
idea correctly, we can generalise src/math/categoricals.c to calculate
even the most complicated interactions like Jason describes.  After all,
a categorical variable is a degenerate case of an interaction (one which
has only a single variable). 

So if we implement all the routines using the existing src/math/categoricals.c
and src/math/covariance.c then I believe we can extend to routines requiring
interactions without changing anything other than those modules (and some
syntax parsing of course).

One problem with the existing src/math/categoricals.c is that it doesn't
properly handle missing values.  Missing values create more problems than is
immediately apparent, because they add another dimension to the task.  In
fact I think that missing values are going to be a more challanging problem
than interactions.

So, to answer the question what to do next, I propose:

1. Change REGRESSION etc. to use new covariance routines. and test thoroughly.

2. Fix up the missing value problem  and write some rigorous test routines.

3. Drop the existing src/category.c and covariance-matrix.c

4. Merge ?

5. Extend src/math/categoricals.c to work with interactions between arbitrary
   permutations of variables.  Possibly the file should be renamed to 
   src/math/interactions.c  or similar ? 

6. Write a super-dooper GLM routine using the new modules.


This programme does mean that user visible changes are implemented last, but
I would rather have robust features introduced later than fragile ones earlier.
Also, as I said, my biggest doubt at the moment is the missing values vs. 
categoricals
issue.  To my mind, it's the single issue which is most likely to require
major redesign of categoricals.c - if it does, then I'd rather find out early.


J'


On Fri, Nov 20, 2009 at 04:57:24PM -0500, Jason Stover wrote:
     On Fri, Nov 20, 2009 at 03:44:11PM -0500, Jason Stover wrote:
     > So when do we merge? 
     
     Not yet, I think.
     
     I was just looking at what needs to be done to make interactions
     possible for the GLM procedure. I also discussed this with Ben
     via IRC. 
     
     It seems that adding the interactions is going to be trickier than
     just fixing the code in interaction.c. An interaction for us is just a
     product of values of two or more variables. So, for example, if var1
     and var2 interact, we would need to compute all possible combinations
     of values of var1 an var2. Each of these combinations would go into
     computing the covariance matrix, just as any other values would.
     
     So an "interaction" must be like a variable, in that it has at least
     one column in a covariance matrix.
     
     Next, if var1 and var2 are numeric, a "combination" of their values is
     just their product. This is easy to compute as we pass the data. So to
     include the interaction of var1 and var2 in the covariance matrix, we
     would just make a new variable, pass that to the constructor for the
     covariance matrix, and for each case in our data-reading loop, compute
     the product of the values of var1 and var2, append that to the case,
     and send that case along to covariance_accumulate_pass[12].
     
     The complication enters if var1 is categorical and var2 is
     numeric. Then, instead of having bit-vectors as computed in
     category.c, we would need the scalar product of the numeric value from
     var2, times the bit vector from var1. So for example, if we
     encountered var1's value 'a', encoded that as (0 0 1 0), and a 2.2 for
     var2, then we would need to use (0 0 2.2 0) in the computation of the
     covariance matrix. This raises some obvious questions about what that
     interaction should be: It can't be a variable because it has both
     categorical and numeric attributes. How should it be appended to the
     case being read? How should covariance.c deal with it?
     
     There is a further complication if both var1 and var2 are
     categorical. Now we must encode the interaction as a bit vector for
     its use in computing the covariance. So for example, if we see 'a' for
     var1 and 'b' for var2, we should encode that as, say, (0 0 1 0 0
     0 0). Now if we have n categories for var1 and m categories for var2,
     then we would have n*m categories for var1 interacting with var2,
     which means we would need a bit vector of length n*m - 1 to handle the
     interaction between var1 and var2. Where should this be stored? Maybe
     some function to smash the two values together and append it to the
     case being read? I don't know.
     
     Here is a further complication: The user could specify any number of
     variables in an interaction. So instead of var1 interacting with var2,
     the user could specify var1, var2,... vark all interacting
     together. This would be a bad idea for most experimental designs, but
     it is computationally just fine.
     
     So the question of how to make interactions seems difficult because
     its answer must involve reading cases, computing new variables, and
     encoding vectors from strings and numeric values. I'm asking how to
     do this here, because the last time I tried it I made a mess. But it
     is important, and necessary for a GLM command and many other modeling
     procedures.
     
     Any suggestions? (John, you want to just code this up over lunch?)
     
     
     > 
     > And what to do next? Here is a list of tasks that
     > stem from having the new covariance.[ch]:
     > 
     > 1. Change linreg.c, coefficient.c and regression.q to use the new 
covariance
     > routines. 
     > 
     > 2. Drop src/data/category.c and covariance-matrix.[ch].
     > 
     > 3. Rewrite interaction.c to use covariance.c.
     > 
     > I would prefer to finish a GLM before changing linreg.c too much, but
     > I'm afraid doing so will just make more work later. Also, linreg.c
     > will have to be changed to use the new covariance struct anyway, and
     > doing so without dropping its current behavior of using the entire
     > data set would make it a lot uglier in the meantime.
     > 
     > 
     > 
     > _______________________________________________
     > pspp-dev mailing list
     > address@hidden
     > http://lists.gnu.org/mailman/listinfo/pspp-dev
     
     
     _______________________________________________
     pspp-dev mailing list
     address@hidden
     http://lists.gnu.org/mailman/listinfo/pspp-dev

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]