[MRG+2] Incorrect implementation of explained_variance_ in PCA #9105

qinhanmin2014 · 2017-06-11T02:22:19Z

Reference Issue

Fixes #7699
Mentioned in #8541 (the 5th comment by IainStrachan on 11 May)
Mentioned in #8544 (the 6th comment by IainStrachan on 11 May)

What does this implement/fix? Explain your changes.

PCA.explained_variance_ is incorrectly implemented. The test is also wrong so that the mistake is not detected.

The result of explained_variance_ is wrong.

result from python:

result from R:

result according to the definition:

Reason for the incorrect implementation.

(1)In the code, explained_variance_ = (S ** 2) / n_samples should be explained_variance_ = (S ** 2) / (n_samples - 1). Because when using SVD to calculate PCA, we get the eigenvalue of A'A(A' means the transpose of A), but what we need is the eigenvalue of A'A/(n_samples - 1) , which is equivalent to the covariance matrix of A.
(2)In the test, we simply use np.var to calculate the variance. That's incorrect. We should use np.cor and pick the elements on the diagonal or set ddof=1 .

Any other comments?

I provide two ways for the new test, one based on current version, another based on the definition.
Please take some time to consider it. Thanks.

agramfort · 2017-06-12T07:31:07Z

sklearn/decomposition/tests/test_pca.py

+
+    # X_rpca = rpca.transform(X)
+    # assert_array_almost_equal(rpca.explained_variance_,
+    #                           expected_result, decimal=1)


uncomment or remove

@agramfort Thanks. From my perspective, I think the test is necessary because it is based on the original definition, so I uncomment it, but I would like to know your advise.

agramfort · 2017-06-12T09:36:14Z

+1 for MRG when CIs are happy

amueller · 2017-06-19T00:05:51Z

This is issue #7699 right? I thought we were ok with the current implementation? I guess we changed our opinion? I don't have a strong opinion.

amueller · 2017-06-19T00:08:51Z

@qinhanmin2014 I think when looking at #7699 I checked the definition of PCA, and I don't think any of the books I looked at mentioned the bessel correction. In particular Elements of Statistical Learning, which is my standard reference. Can you provide a reference with your definition, other than prcomp?

qinhanmin2014 · 2017-06-19T13:06:41Z

@amueller Thanks for your instruction, this issue is indeed related to #7699

(1)I think the problem is not whether to use bessel correction. The main point is that the variance is supposed to be equal to the eigenvalue of the covariance matrix. I propose an additional test to prove my implementation.
(2)It seems that Elements of Statistical Learning simply use SVD to calculate PCA and say nothing about the covariance matrix and the calculation of variance.
(3)R, Matlab(princomp) implemented in the same way as the pull request.
(4)Many papers and books also implemented in the same way as the pull request.(e.g. https://arxiv.org/abs/1404.1100 (P11-P12), Machine Learning in Action (https://github.com/pbharrin/machinelearninginaction/blob/master/Ch13/pca.py)). but I currently can't find one which implemented in the same way as sklearn.

What's more, from my perspective, even if we dont't find any strong evidence, it may be better to ensure that sklearn behave the same as others.

amueller · 2017-06-20T16:39:41Z

Oh wow, I misread the covariance test. Never mind, that test is clearly correct.
LGTM. Could you please add an entry to whats_new.rst in the bugfix section?

agramfort · 2017-06-20T19:49:28Z

@qinhanmin2014 please update what's new and let's merge

qinhanmin2014 · 2017-06-21T00:07:30Z

@amueller @agramfort Finished. I also revert #7843 since these statements in the document are no longer needed. Thanks.

amueller · 2017-06-21T03:21:05Z

thanks :)

…t-learn#9105) * fix pca explained_variance_ * fix fit_transform * fix test_whitening * fix IncrementalPCA * uncomment the test * improve test * make CI green * revert scikit-learn#7843 and add what's new * fix what's new

qinhanmin2014 added 4 commits June 11, 2017 09:57

fix pca explained_variance_

d005729

fix fit_transform

bda1de1

fix test_whitening

0a46cb3

fix IncrementalPCA

d635301

qinhanmin2014 changed the title ~~[WIP] Incorrect implementation of explained_variance_ in PCA~~ [MRG] Incorrect implementation of explained_variance_ in PCA Jun 11, 2017

agramfort reviewed Jun 12, 2017

View reviewed changes

qinhanmin2014 added 2 commits June 12, 2017 16:47

uncomment the test

5d6f266

improve test

71573fc

agramfort changed the title ~~[MRG] Incorrect implementation of explained_variance_ in PCA~~ [MRG+1] Incorrect implementation of explained_variance_ in PCA Jun 12, 2017

jnothman added this to the 0.19 milestone Jun 18, 2017

make CI green

8d13fcb

amueller changed the title ~~[MRG+1] Incorrect implementation of explained_variance_ in PCA~~ [MRG+2] Incorrect implementation of explained_variance_ in PCA Jun 20, 2017

amueller approved these changes Jun 20, 2017

View reviewed changes

qinhanmin2014 added 2 commits June 21, 2017 07:12

revert #7843 and add what's new

6496f5f

fix what's new

de70e9b

amueller merged commit 2a36ff1 into scikit-learn:master Jun 21, 2017

qinhanmin2014 deleted the my-feature-1 branch June 21, 2017 03:39

qinhanmin2014 mentioned this pull request Jun 23, 2017

[MRG+1] Incorrent implementation of noise_variance_ in PCA._fit_truncated #9108

Merged

amueller mentioned this pull request Nov 14, 2017

PCA implementation does not match Tipping and Bishop #10137

Closed

aldanor mentioned this pull request Apr 20, 2018

PCA(whiten=True): unit variances != 1 (regression in 0.19) #11001

Closed

lorentzenchr mentioned this pull request Feb 6, 2021

Mle pca implementation #19378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] Incorrect implementation of explained_variance_ in PCA #9105

[MRG+2] Incorrect implementation of explained_variance_ in PCA #9105

[MRG+2] Incorrect implementation of explained_variance_ in PCA #9105

[MRG+2] Incorrect implementation of explained_variance_ in PCA #9105

Conversation

Reference Issue

What does this implement/fix? Explain your changes.

The result of explained_variance_ is wrong.

Reason for the incorrect implementation.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment