-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: added functionality nancov to numpy #5698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/lib/nanfunctions.py
Outdated
X = np.concatenate((X, y), axis) | ||
|
||
# Remove observations with nans from the set | ||
nan_observations = np.isnan(X.sum(axis=axis)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the correct thing to do?
I would think that if, e.g. we have a (3, 5)
array, with a single NaN
in position [0, 2]
, we would not use the third observation to compute covariances involving the first variable, but we would for e.g. in this case the covariance of the second and third variable. And here you are simply getting rid of the whole column, which doesn't seem quite right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you point that out, no, it's not correct. Looking at the problem, though, I don't see a great way to accomplish this except for iterating over every combination of the rows, since each combination could, in theory, have a unique set of NaNs. However, that doesn't seem terribly elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd tend to use stacked outer products and apply nansum
on the first index, nanmean
ideally, but it doesn't have ddof
. That is, if the rows hold the observation, form a_{ijk} = v_{ij} *{ v_ik)
, and nansum on i
if the rows hold the observations. You can also demean the observations (nanmean), set the nans to zero, and do the normal matrix multiplication, The counts for each entry can be gotten by matrix multiplying the mask matrix ~isnan(obs)
in the same way as the zero fixed observations. So let obs
(in columns) be the demeaned and zeroed observations, and msk
the corresponding mask, then I think dot (obs, obs.T) / (dot(msk, msk.T) - ddof)
will do it. Need to be careful about flagging negative or zero values in the denominator though, maybe set those entries to nan and issue a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't are multiple reasonable definitions of the covariance matrix in
the presence of missing values. E.g. including all pairs that contain no
nan gives you the best estimate of individual entries, but the matrix may
then fail to be positive definite. R provides multiple methods, see the
use= argument:
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html
On Mar 19, 2015 7:15 PM, "dfreese" notifications@github.com wrote:
In numpy/lib/nanfunctions.py
#5698 (comment):
dtype = np.result_type(m, y, np.float64)
- X = np.array(m, ndmin=2, dtype=dtype)
- if X.shape[0] == 1:
rowvar = 1
- if rowvar:
axis = 0
- else:
axis = 1
- if y is not None:
y = np.array(y, copy=False, ndmin=2, dtype=dtype)
X = np.concatenate((X, y), axis)
Remove observations with nans from the set
- nan_observations = np.isnan(X.sum(axis=axis))
Now that you point that out, no, it's not correct. Looking at the problem,
though, I don't see a great way to accomplish this except for iterating
over every combination of the rows, since each combination could, in
theory, have a unique set of NaNs. However, that doesn't seem terribly
elegant.—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/5698/files#r26814593.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given what @njsmith pointed out, this current implementation is equivalent to "na.or.complete," but what @jaimefrio was pointing towards was "pairwise.complete.obs." The remainder of the R options would not make sense for a NaN dedicated function. Is it best to go for both options, or pick one? It would be pretty straight-forward to warn the user if the matrix was not positive semi-definite in the, "pairwise.complete.obs," case.
Rejigger tests by closing and opening. |
Again. |
Note that the weighted covariance work #4960 is going to complicate this. |
☔ The latest upstream changes (presumably #7421) made this pull request unmergeable. Please resolve the merge conflicts. |
b887488
to
f0305e5
Compare
3f846e6
to
762581d
Compare
I updated this pull request so that it implements aweights and fweights that were added to cov. Based on the suggestions from earlier this implements two methods: one that an observation from all variables if there is a nan in any observation, and a second that compares all of the variables pairwise, eliminating only the observations containing nan values in the pair. It also drops the bias option in favor of solely ddof. Please let me know if you have any comments on the current status. |
☔ The latest upstream changes (presumably #7099) made this pull request unmergeable. Please resolve the merge conflicts. |
58b94a9
to
10136ac
Compare
@charris, I know this isn't a huge priority, but what are the chances of getting this incorporated into 1.12? |
☔ The latest upstream changes (presumably #8240) made this pull request unmergeable. Please resolve the merge conflicts. |
@dfreese I just branched 1.12.x, sorry I overlooked your inquiry. However, there should be more time to look into PRs onece 1.12.0 is out. |
10136ac
to
24eb164
Compare
@charris Not a problem. Thanks for your work. I've updated the PR for 1.13 |
☔ The latest upstream changes (presumably #7742) made this pull request unmergeable. Please resolve the merge conflicts. |
24eb164
to
90df5d9
Compare
90df5d9
to
3a13062
Compare
☔ The latest upstream changes (presumably #8446) made this pull request unmergeable. Please resolve the merge conflicts. |
3a13062
to
9d20b3c
Compare
dd70ab4
to
6fc9ff4
Compare
6fc9ff4
to
6bec97d
Compare
@eric-wieser it looks like you had recently taken a look at nanfunctions.py. Would you be willing to take a look at this and see if you have any comments on the current state? |
Closing as there hasn't been any movement on this. |
I think this slipped under my radar. I might be able to take a look. |
"incompatible numbers of samples and aweights") | ||
if any(aweights < 0): | ||
raise ValueError( | ||
"aweights cannot be negative") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a lot happier if all this could be extracted into a helper function shared by cov
and nancov
|
||
>>> X = np.array([[ 1., 2., 10., np.nan, 5.], | ||
... [ 4., 10., 13., np.nan, 8.], | ||
... [-2., -4., np.nan, np.nan, -10.]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: you're missing a space to make this align
numpy/lib/nanfunctions.py
Outdated
>>> np.nancov(X, pairwise=True) | ||
array([[ 16.33333333, 12.16666667, -8.66666667], | ||
[ 12.16666667, 14.25 , -5.33333333], | ||
[ -8.66666667, -5.33333333, 17.33333333]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pick an example that avoids a factor of 1/3 here?
Adds a function nancov to nanfunctions in numpy that mimics cov, while ignoring nan values. Implements two different ways for ignoring nans. The default eliminates nan observations across all variables, while the pairwise option only eliminates nan observations in respective pairs of variables. The bias option was dropped leaving ddof as the primary method of handling bias in the estimate. fweights and aweights are supported.
6bec97d
to
7620967
Compare
Hello all; I see this thread hasn't got any movement on it. Is there anything outstanding that's blocking a merge? |
I think the ship has sailed on this, we don't really want any more Closing. |
For those who stumble upon this, I was able to get most of the benefit of this by using existing nan functions:
|
Sorry, I was a bit brief. And should have started with: thank you @dfreese for your contribution. You opened this PR when NumPy development/maintenance was in fairly poor shape back in 2015. Sorry it took so long to get feedback. In addition to what I wrote above, https://github.com/pydata/bottleneck has other |
No worries. I had enough review to go on, it just kept getting bumped on the priority list. (I didn't actually need it) The decision to limit the API surface of numpy makes a lot of sense. |
Implemented nancov and associated tests so that nancov ignores all observations passed to it that contain nans and then calculates the cov.