ENH: added functionality nancov to numpy #5698

dfreese · 2015-03-20T00:03:36Z

Implemented nancov and associated tests so that nancov ignores all observations passed to it that contain nans and then calculates the cov.

jaimefrio · 2015-03-20T00:23:18Z

numpy/lib/nanfunctions.py

+        X = np.concatenate((X, y), axis)
+
+    # Remove observations with nans from the set
+    nan_observations = np.isnan(X.sum(axis=axis))


Is this the correct thing to do?

I would think that if, e.g. we have a (3, 5) array, with a single NaN in position [0, 2], we would not use the third observation to compute covariances involving the first variable, but we would for e.g. in this case the covariance of the second and third variable. And here you are simply getting rid of the whole column, which doesn't seem quite right.

Now that you point that out, no, it's not correct. Looking at the problem, though, I don't see a great way to accomplish this except for iterating over every combination of the rows, since each combination could, in theory, have a unique set of NaNs. However, that doesn't seem terribly elegant.

I'd tend to use stacked outer products and apply nansum on the first index, nanmean ideally, but it doesn't have ddof. That is, if the rows hold the observation, form a_{ijk} = v_{ij} *{ v_ik), and nansum on i if the rows hold the observations. You can also demean the observations (nanmean), set the nans to zero, and do the normal matrix multiplication, The counts for each entry can be gotten by matrix multiplying the mask matrix ~isnan(obs) in the same way as the zero fixed observations. So let obs (in columns) be the demeaned and zeroed observations, and msk the corresponding mask, then I think dot (obs, obs.T) / (dot(msk, msk.T) - ddof) will do it. Need to be careful about flagging negative or zero values in the denominator though, maybe set those entries to nan and issue a warning.

There isn't are multiple reasonable definitions of the covariance matrix in
the presence of missing values. E.g. including all pairs that contain no
nan gives you the best estimate of individual entries, but the matrix may
then fail to be positive definite. R provides multiple methods, see the
use= argument:
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html
On Mar 19, 2015 7:15 PM, "dfreese" notifications@github.com wrote:

In numpy/lib/nanfunctions.py
#5698 (comment):

dtype = np.result_type(m, y, np.float64)

X = np.array(m, ndmin=2, dtype=dtype)

if X.shape[0] == 1:

rowvar = 1

if rowvar:

axis = 0

else:

axis = 1

if y is not None:

y = np.array(y, copy=False, ndmin=2, dtype=dtype)

X = np.concatenate((X, y), axis)

Remove observations with nans from the set

nan_observations = np.isnan(X.sum(axis=axis))

Now that you point that out, no, it's not correct. Looking at the problem,
though, I don't see a great way to accomplish this except for iterating
over every combination of the rows, since each combination could, in
theory, have a unique set of NaNs. However, that doesn't seem terribly
elegant.

—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/5698/files#r26814593.

Given what @njsmith pointed out, this current implementation is equivalent to "na.or.complete," but what @jaimefrio was pointing towards was "pairwise.complete.obs." The remainder of the R options would not make sense for a NaN dedicated function. Is it best to go for both options, or pick one? It would be pretty straight-forward to warn the user if the matrix was not positive semi-definite in the, "pairwise.complete.obs," case.

charris · 2015-03-22T16:01:42Z

Rejigger tests by closing and opening.

charris · 2015-03-22T16:42:13Z

Again.

charris · 2015-05-04T01:11:28Z

Note that the weighted covariance work #4960 is going to complicate this.

homu · 2016-03-26T19:04:13Z

☔ The latest upstream changes (presumably #7421) made this pull request unmergeable. Please resolve the merge conflicts.

dfreese · 2016-07-12T21:05:07Z

I updated this pull request so that it implements aweights and fweights that were added to cov. Based on the suggestions from earlier this implements two methods: one that an observation from all variables if there is a nan in any observation, and a second that compares all of the variables pairwise, eliminating only the observations containing nan values in the pair. It also drops the bias option in favor of solely ddof.

Please let me know if you have any comments on the current status.

homu · 2016-09-02T14:13:50Z

☔ The latest upstream changes (presumably #7099) made this pull request unmergeable. Please resolve the merge conflicts.

dfreese · 2016-10-02T23:59:16Z

@charris, I know this isn't a huge priority, but what are the chances of getting this incorporated into 1.12?

homu · 2016-11-05T20:17:54Z

☔ The latest upstream changes (presumably #8240) made this pull request unmergeable. Please resolve the merge conflicts.

charris · 2016-11-05T20:31:21Z

@dfreese I just branched 1.12.x, sorry I overlooked your inquiry. However, there should be more time to look into PRs onece 1.12.0 is out.

dfreese · 2016-11-07T07:44:13Z

@charris Not a problem. Thanks for your work. I've updated the PR for 1.13

homu · 2016-11-14T00:04:54Z

☔ The latest upstream changes (presumably #7742) made this pull request unmergeable. Please resolve the merge conflicts.

homu · 2017-02-22T16:25:05Z

☔ The latest upstream changes (presumably #8446) made this pull request unmergeable. Please resolve the merge conflicts.

dfreese · 2017-09-01T00:06:45Z

@eric-wieser it looks like you had recently taken a look at nanfunctions.py. Would you be willing to take a look at this and see if you have any comments on the current state?

dfreese · 2018-02-17T00:39:52Z

Closing as there hasn't been any movement on this.

eric-wieser · 2018-02-17T02:12:17Z

I think this slipped under my radar. I might be able to take a look.

eric-wieser · 2018-02-17T02:13:26Z

numpy/lib/nanfunctions.py

+                "incompatible numbers of samples and aweights")
+        if any(aweights < 0):
+            raise ValueError(
+                "aweights cannot be negative")


I'd be a lot happier if all this could be extracted into a helper function shared by cov and nancov

eric-wieser · 2018-02-17T02:14:46Z

numpy/lib/nanfunctions.py

+
+    >>> X = np.array([[ 1.,  2.,     10., np.nan,   5.],
+    ...               [ 4., 10.,     13., np.nan,   8.],
+    ...               [-2., -4., np.nan, np.nan, -10.]])


super-nit: you're missing a space to make this align

eric-wieser · 2018-02-17T02:15:18Z

numpy/lib/nanfunctions.py

+    >>> np.nancov(X, pairwise=True)
+    array([[ 16.33333333,  12.16666667,  -8.66666667],
+           [ 12.16666667,  14.25      ,  -5.33333333],
+           [ -8.66666667,  -5.33333333,  17.33333333]])


Can you pick an example that avoids a factor of 1/3 here?

Adds a function nancov to nanfunctions in numpy that mimics cov, while ignoring nan values. Implements two different ways for ignoring nans. The default eliminates nan observations across all variables, while the pairwise option only eliminates nan observations in respective pairs of variables. The bias option was dropped leaving ddof as the primary method of handling bias in the estimate. fweights and aweights are supported.

chrism2671 · 2020-06-17T08:47:23Z

Hello all; I see this thread hasn't got any movement on it. Is there anything outstanding that's blocking a merge?

rgommers · 2020-06-17T08:56:00Z

I think the ship has sailed on this, we don't really want any more nan* functions, see #13198 (comment).

Closing.

chrism2671 · 2020-06-17T09:51:55Z

For those who stumble upon this, I was able to get most of the benefit of this by using existing nan functions:

def myCorr(x, y):
    sigma_x_y = np.nanstd(x) * np.nanstd(y)
    covariance = np.nanmean((x-np.nanmean(x))*(y-np.nanmean(y)))
    return covariance/sigma_x_y

rgommers · 2020-06-17T10:08:54Z

Sorry, I was a bit brief. And should have started with: thank you @dfreese for your contribution. You opened this PR when NumPy development/maintenance was in fairly poor shape back in 2015. Sorry it took so long to get feedback.

In addition to what I wrote above, https://github.com/pydata/bottleneck has other nan* functions that are not in NumPy, and would be a more appropriate place to add more.

dfreese · 2020-06-17T17:36:41Z

No worries. I had enough review to go on, it just kept getting bumped on the priority list. (I didn't actually need it) The decision to limit the API surface of numpy makes a lot of sense.

jaimefrio reviewed Mar 20, 2015
View reviewed changes

charris added 01 - Enhancement component: numpy.lib labels Mar 20, 2015

charris closed this Mar 22, 2015

charris reopened this Mar 22, 2015

charris closed this Mar 22, 2015

charris reopened this Mar 22, 2015

dfreese force-pushed the feature/nancov branch 3 times, most recently from b887488 to f0305e5 Compare July 12, 2016 16:34

dfreese closed this Jul 12, 2016

dfreese reopened this Jul 12, 2016

dfreese force-pushed the feature/nancov branch 2 times, most recently from 3f846e6 to 762581d Compare July 12, 2016 20:15

dfreese force-pushed the feature/nancov branch from 762581d to cfff91b Compare July 22, 2016 00:52

dfreese force-pushed the feature/nancov branch 2 times, most recently from 58b94a9 to 10136ac Compare September 6, 2016 21:22

dfreese force-pushed the feature/nancov branch from 10136ac to 24eb164 Compare November 7, 2016 07:42

dfreese force-pushed the feature/nancov branch from 24eb164 to 90df5d9 Compare January 10, 2017 17:36

dfreese force-pushed the feature/nancov branch from 90df5d9 to 3a13062 Compare February 18, 2017 19:56

dfreese force-pushed the feature/nancov branch from 3a13062 to 9d20b3c Compare February 22, 2017 16:32

dfreese force-pushed the feature/nancov branch 4 times, most recently from dd70ab4 to 6fc9ff4 Compare August 28, 2017 16:50

dfreese force-pushed the feature/nancov branch from 6fc9ff4 to 6bec97d Compare August 31, 2017 23:42

dfreese closed this Feb 17, 2018

eric-wieser reviewed Feb 17, 2018

View reviewed changes

eric-wieser reopened this Feb 17, 2018

dfreese force-pushed the feature/nancov branch from 6bec97d to 7620967 Compare February 24, 2018 08:46

rgommers closed this Jun 17, 2020

dfreese deleted the feature/nancov branch June 17, 2020 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: added functionality nancov to numpy #5698

ENH: added functionality nancov to numpy #5698

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove observations with nans from the set

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH: added functionality nancov to numpy #5698

ENH: added functionality nancov to numpy #5698

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Remove observations with nans from the set

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!