8000 ENH: added functionality nancov to numpy by dfreese · Pull Request #5698 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: added functionality nancov to numpy #5698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dfreese
Copy link
@dfreese dfreese commented Mar 20, 2015

Implemented nancov and associated tests so that nancov ignores all observations passed to it that contain nans and then calculates the cov.

X = np.concatenate((X, y), axis)

# Remove observations with nans from the set
nan_observations = np.isnan(X.sum(axis=axis))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct thing to do?

I would think that if, e.g. we have a (3, 5) array, with a single NaN in position [0, 2], we would not use the third observation to compute covariances involving the first variable, but we would for e.g. in this case the covariance of the second and third variable. And here you are simply getting rid of the whole column, which doesn't seem quite right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you point that out, no, it's not correct. Looking at the problem, though, I don't see a great way to accomplish this except for iterating over every combination of the rows, since each combination could, in theory, have a unique set of NaNs. However, that doesn't seem terribly elegant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd tend to use stacked outer products and apply nansum on the first index, nanmean ideally, but it doesn't have ddof. That is, if the rows hold the observation, form a_{ijk} = v_{ij} *{ v_ik), and nansum on i if the rows hold the observations. You can also demean the observations (nanmean), set the nans to zero, and do the normal matrix multiplication, The counts for each entry can be gotten by matrix multiplying the mask matrix ~isnan(obs) in the same way as the zero fixed observations. So let obs (in columns) be the demeaned and zeroed observations, and msk the corresponding mask, then I think dot (obs, obs.T) / (dot(msk, msk.T) - ddof) will do it. Need to be careful about flagging negative or zero values in the denominator though, maybe set those entries to nan and issue a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't are multiple reasonable definitions of the covariance matrix in
the presence of missing values. E.g. including all pairs that contain no
nan gives you the best estimate of individual entries, but the matrix may
then fail to be positive definite. R provides multiple methods, see the
use= argument:
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html
On Mar 19, 2015 7:15 PM, "dfreese" notifications@github.com wrote:

In numpy/lib/nanfunctions.py
#5698 (comment):

  •    dtype = np.result_type(m, y, np.float64)
    
  • X = np.array(m, ndmin=2, dtype=dtype)
  • if X.shape[0] == 1:
  •    rowvar = 1
    
  • if rowvar:
  •    axis = 0
    
  • else:
  •    axis = 1
    
  • if y is not None:
  •    y = np.array(y, copy=False, ndmin=2, dtype=dtype)
    
  •    X = np.concatenate((X, y), axis)
    
  • Remove observations with nans from the set

  • nan_observations = np.isnan(X.sum(axis=axis))

Now that you point that out, no, it's not correct. Looking at the problem,
though, I don't see a great way to accomplish this except for iterating
over every combination of the rows, since each combination could, in
theory, have a unique set of NaNs. However, that doesn't seem terribly
elegant.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/5698/files#r26814593.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what @njsmith pointed out, this current implementation is equivalent to "na.or.complete," but what @jaimefrio was pointing towards was "pairwise.complete.obs." The remainder of the R options would not make sense for a NaN dedicated function. Is it best to go for both options, or pick one? It would be pretty straight-forward to warn the user if the matrix was not positive semi-definite in the, "pairwise.complete.obs," case.

@charris
Copy link
Member
charris commented Mar 22, 2015

Rejigger tests by closing and opening.

@charris charris closed this Mar 22, 2015
@charris charris reopened this Mar 22, 2015
@charris
Copy link
Member
charris commented Mar 22, 2015

Again.

@charris charris closed this Mar 22, 2015
@charris charris reopened this Mar 22, 2015
@charris
Copy link
Member
charris commented May 4, 2015

Note that the weighted covariance work #4960 is going to complicate this.

@homu
Copy link
Contributor
homu commented Mar 26, 2016

☔ The latest upstream changes (presumably #7421) made this pull request unmergeable. Please resolve the merge conflicts.

@dfreese dfreese force-pushed the feature/nancov branch 3 times, most recently from b887488 to f0305e5 Compare July 12, 2016 16:34
@dfreese dfreese closed this Jul 12, 2016
@dfreese dfreese reopened this Jul 12, 2016
@dfreese dfreese force-pushed the feature/nancov branch 2 times, most recently from 3f846e6 to 762581d Compare July 12, 2016 20:15
@dfreese
Copy link
Author
dfreese commented Jul 12, 2016

I updated this pull request so that it implements aweights and fweights that were added to cov. Based on the suggestions from earlier this implements two methods: one that an observation from all variables if there is a nan in any observation, and a second that compares all of the variables pairwise, eliminating only the observations containing nan values in the pair. It also drops the bias option in favor of solely ddof.

Please let me know if you have any comments on the current status.

@homu
Copy link
Contributor
homu commented Sep 2, 2016

☔ The latest upstream changes (presumably #7099) made this pull request unmergeable. Please resolve the merge conflicts.

@dfreese dfreese force-pushed the feature/nancov branch 2 times, most recently from 58b94a9 to 10136ac Compare September 6, 2016 21:22
@dfreese
Copy link
Author
dfreese commented Oct 2, 2016

@charris, I know this isn't a huge priority, but what are the chances of getting this incorporated into 1.12?

@homu
Copy link
Contributor
homu commented Nov 5, 2016

☔ The latest upstream changes (presumably #8240) made this pull request unmergeable. Please resolve the merge conflicts.

@charris
Copy link
Member
charris commented Nov 5, 2016

@dfreese I just branched 1.12.x, sorry I overlooked your inquiry. However, there should be more time to look into PRs onece 1.12.0 is out.

@dfreese
Copy link
Author
dfreese commented Nov 7, 2016

@charris Not a problem. Thanks for your work. I've updated the PR for 1.13

@homu
Copy link
Contributor
homu commented Nov 14, 2016

☔ The latest upstream changes (presumably #7742) made this pull request unmergeable. Please resolve the merge conflicts.

@homu
Copy link
Contributor
homu commented Feb 22, 2017

☔ The latest upstream changes (presumably #8446) made this pull request unmergeable. Please resolve the merge conflicts.

@dfreese dfreese force-pushed the feature/nancov branch 4 times, most recently from dd70ab4 to 6fc9ff4 Compare August 28, 2017 16:50
@dfreese
Copy link
Author
dfreese commented Sep 1, 2017

@eric-wieser it looks like you had recently taken a look at nanfunctions.py. Would you be willing to take a look at this and see if you have any comments on the current state?

@dfreese
Copy link
Author
dfreese commented Feb 17, 2018

Closing as there hasn't been any movement on this.

@dfreese dfreese closed this Feb 17, 2018
@eric-wieser
Copy link
Member

I think this slipped under my radar. I might be able to take a look.

"incompatible numbers of samples and aweights")
if any(aweights < 0):
raise ValueError(
"aweights cannot be negative")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a lot happier if all this could be extracted into a helper function shared by cov and nancov


>>> X = np.array([[ 1., 2., 10., np.nan, 5.],
... [ 4., 10., 13., np.nan, 8.],
... [-2., -4., np.nan, np.nan, -10.]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: you're missing a space to make this align

>>> np.nancov(X, pairwise=True)
array([[ 16.33333333, 12.16666667, -8.66666667],
[ 12.16666667, 14.25 , -5.33333333],
[ -8.66666667, -5.33333333, 17.33333333]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pick an example that avoids a factor of 1/3 here?

@eric-wieser eric-wieser reopened this Feb 17, 2018
Adds a function nancov to nanfunctions in numpy that mimics cov, while
ignoring nan values.  Implements two different ways for ignoring nans.
The default eliminates nan observations across all variables, while the
pairwise option only eliminates nan observations in respective pairs of
variables.  The bias option was dropped leaving ddof as the primary
method of handling bias in the estimate.  fweights and aweights are
supported.
@chrism2671
Copy link

Hello all; I see this thread hasn't got any movement on it. Is there anything outstanding that's blocking a merge?

@rgommers
Copy link
Member

I think the ship has sailed on this, we don't really want any more nan* functions, see #13198 (comment).

Closing.

@rgommers rgommers closed this Jun 17, 2020
@chrism2671
Copy link

For those who stumble upon this, I was able to get most of the benefit of this by using existing nan functions:

def myCorr(x, y):
    sigma_x_y = np.nanstd(x) * np.nanstd(y)
    covariance = np.nanmean((x-np.nanmean(x))*(y-np.nanmean(y)))
    return covariance/sigma_x_y

@rgommers
Copy link
Member

Sorry, I was a bit brief. And should have started with: thank you @dfreese for your contribution. You opened this PR when NumPy development/maintenance was in fairly poor shape back in 2015. Sorry it took so long to get feedback.

In addition to what I wrote above, https://github.com/pydata/bottleneck has other nan* functions that are not in NumPy, and would be a more appropriate place to add more.

@dfreese
Copy link
Author
dfreese commented Jun 17, 2020

No worries. I had enough review to go on, it just kept getting bumped on the priority list. (I didn't actually need it) The decision to limit the API surface of numpy makes a lot of sense.

@dfreese dfreese deleted the feature/nancov branch June 17, 2020 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0