[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bug in covariance.c/categoricals.c
From: |
John Darrington |
Subject: |
Re: bug in covariance.c/categoricals.c |
Date: |
Sun, 12 Jun 2011 20:42:23 +0000 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
You're right. It's a problem of semantics. The current implementation
encodes each categorical variables into N columns, where N is the number
of distinct values. For the covariance matrix we require N - 1 columns.
Can you try the attached patch. I think it will solve your problem.
(It causes all the ONEWAY tests to fail but we can think about that later).
J'
On Sun, Jun 12, 2011 at 12:40:37PM -0400, Jason Stover wrote:
I have found a bug in the computation of the covariance matrix when
categorical variables are involved. It is very subtle, and wasn't easy
to find because it's caused by an inconsistency in the way
covariance.c and categoricals.c interpret the encoded categorical
values, rather than a straightforward miscomputation.
Here is the syntax to generate the problem:
data list list / v0 v1 v2.
begin data
3.2 1 1
3.1 1 1
3.3 1 2
3.4 1 2
3.2 1 3
3.3 1 3
3.3 1 4
3.2 1 4
2.8 2 1
2.9 2 1
3.3 2 2
3.0 2 2
3.1 2 3
3.2 2 3
3.2 2 4
3.1 2 4
end data
GLM v0 by v1 v2
/INTERCEPT = include.
dump_matrix in glm.c gives this as the covariance matrix:
0.378 0.700 -0.700 -0.650 0.350
0.700 4.000 -4.000 0.000 0.000
-0.700 -4.000 4.000 0.000 0.000
-0.650 0.000 0.000 3.000 -1.000
0.350 0.000 0.000 -1.000 3.000
Examining how this matrix was computed showed this to be the
encoding covariance.c used for the data:
3.2 1 0 1 0
3.1 1 0 1 0
3.3 1 0 0 1
3.4 1 0 0 1
3.2 1 0 0 0
3.3 1 0 0 0
3.3 1 0 0 0
3.2 1 0 0 0
2.8 0 1 1 0
2.9 0 1 1 0
3.3 0 1 0 1
3.0 0 1 0 1
3.1 0 1 0 0
3.2 0 1 0 0
3.2 0 1 0 0
3.1 0 1 0 0
This is not among the possible correct encodings. An example of a
correct encoding is the following:
3.2 0 1 0 0
3.1 0 1 0 0
3.3 0 0 1 0
3.4 0 0 1 0
3.2 0 0 0 1
3.3 0 0 0 1
3.3 0 0 0 0
3.2 0 0 0 0
2.8 1 1 0 0
2.9 1 1 0 0
3.3 1 0 1 0
3.0 1 0 1 0
3.1 1 0 0 1
3.2 1 0 0 1
3.2 1 0 0 0
3.1 1 0 0 0
The problem happens in a call to categoricals_get_binary_by_subscript (),
called by get_val (), called by covariance_accumulate_pass2 (). To see
the problem, it may be easiest to consider the first case:
3.2 1 1
For this case, get_val is first called with i equal to 0, which is
fine. Then, get_val is called with i equal 1, which causes it to call
categoricals_get_binary_by_subscript (cov->categoricals, 0, c). Inside
categoricals_get_binary_by_subscript, var is v1 (OK), val is 1 (OK),
which matches the value shown by categoricals_get_value_by_subscript,
so the function returns 1 (which could be OK, depending on the
encoding we want).
The next call to categoricals_get_binary_by_subscript shows the
problem. subscript is now 1, which causes var to be v1. val is then 2,
which does not match the second value in the case, so the function
returns 0. This by itself could be fine, but the variable we want on
this second call is now v2. So you can see that sometimes, this
function will return correctly, but sometimes it may not, and we can
see this inconsistency later: The values of 3 and 4 for v2 both end up
being mapped to (0 0) during computation of the covariance, because
covariance_accumulate_pass2 stops asking for binary values at cov->dim
- 1, which is 4. So: get_val gets *two* values for v1 (it should get
only one), and *two* values for v2 (it should get three).
Hence, for the first case, the columns for v1 and v2 are mapped to (1
0 1 0). That would be fine, if we had a mapping as follows:
variable value ------> binary encoded
v1 1 -------> 1
v1 2 ------> 0
v2 1 ------> 0 1 0
v2 2 ------> (something else)
v2 3 ------> (something else)
v2 4 ------> (something else)
where exactly one of those "something else"'s is (0 0 0), one is (1 0
0) and one is (0 0 1). But this can't happen, because, by the current
system, 3 and 4 are ignored, and consequently both map to (0 0).
To fix the problem, covariance.c must properly interpret the encoding
used by categoricals.c. This seems to be mostly a problem in get_val,
but it may be elsewhere around covariance.c.
_______________________________________________
pspp-dev mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/pspp-dev
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
cat.patch
Description: Text Data
signature.asc
Description: Digital signature