[MRG+3] Collapsing PCA and RandomizedPCA #5299

giorgiop · 2015-09-23T10:34:04Z

to do

collapse the two classes
old tests passing
integrate svd_solver=='arpack'
benchmark the 3 solvers and establish the best auto policy
fix docstrings
uniform with IncrementalPCA by inheritance from _BasePCA; see [MRG+2] Incremental PCA #3285
~~add flip_sign param, True by default~~ we flip by default, without introducing a param for controlling the behaviour
arpack backport updated

ogrisel · 2015-09-23T13:09:58Z

I have the feeling that IncrementalPCA has very few common code to share with the regular PCA / _BasePCA classes. IMO we should not try to use inheritance when there is no significant amount of code to reuse. The indirections introduced by inheritance can make it harder to read the code so it has to be justified by code reuse.

kastnerkyle · 2015-09-23T13:17:53Z

_BasePCA was originally introduced at the same time as IncrementalPCA with
an eye toward refactoring PCA to share this common code - most of those
methods should hold for PCA, since they were taken from there originally.
If it is no longer the case that we want to try and unify the two (IncPCA
and PCA), it may be worth just putting _BasePCA back onto IncrementalPCA
and keeping the code separate.

On Wed, Sep 23, 2015 at 9:10 AM, Olivier Grisel notifications@github.com
wrote:

I have the feeling that IncrementalPCA has very few common code to share
with the regular PCA / _BasePCA classes. IMO we should not try to use
inheritance when there is no significant amount of code to reuse. Using
inheritance can make it harder to read the code so it has to be justified
by code reuse.

—
Reply to this email directly or view it on GitHub
#5299 (comment)
.

ogrisel · 2015-09-23T13:51:50Z

Ah ok sorry for the confusion. I thought it was about the old ProbabilisticPCA that was actually already collapsed into PCA.

giorgiop · 2015-09-29T13:52:20Z

Inheritance from _BasePCA was smooth and is now in the branch.

giorgiop · 2015-09-30T08:06:42Z

We have to decide what to do with this behaviour

Due to implementation subtleties of the Singular Value Decomposition (SVD),
which is used in this implementation, running it twice on the same matrix
can lead to principal components with signs flipped (change in direction).
For this reason, it is important to always use the same estimator object to
transform data in a consistent fashion.

This paragraph was in PCA's docstring. RandomizedPCA handled it calling svd_flip in randomized_SVD. We can either call it for all the svd_solver methods or None.

To note: I have been working on the performance of randomized_svd in #5141 and svd_flip can be expensive becuase of this operation.

TomDLT · 2015-09-30T09:31:47Z

sklearn/decomposition/pca.py

-        return X_original
+@deprecated("it will be removed in 0.19. Use PCA(svd_solver='randomized') "
+            "instead. The new implementation DOES NOT store"
+            "whithen components_. Apply transform to get them.")


giorgiop · 2015-09-30T12:12:32Z

I agreed with @ogrisel to call svd_flip in every method. We choose not to expose a flip_sign param, so as to keep lighter signature for the constructor of PCA.

giorgiop · 2015-09-30T12:29:51Z

Simple runtime benchmark script

import numpy as np
from sklearn.decomposition.pca import PCA
from time import time
import gc

rng = np.random.RandomState(42)
X = np.random.uniform(-1, 1, size=(3000, 3000))

solver_list = ['full', 'arpack', 'randomized']

def bench_pca(X, pca):
    gc.collect()
    t0 = time()
    pca.fit(X)
    return time() - t0

for n_comp in [10, 50, 100, 500, 1000, 2400, 2999]:
    print("\n n_components=%d" % n_comp)
    for solver in solver_list:

        print("svd_solver='%s'" % solver)
        duration = bench_pca(X, PCA(n_components=n_comp, svd_solver=solver))
        print("done in %0.3fs" % duration)

Output:

 n_components=10
svd_solver='full'
done in 17.718s
svd_solver='arpack'
done in 3.084s
svd_solver='randomized'
done in 0.588s

 n_components=50
svd_solver='full'
done in 19.997s
svd_solver='arpack'
done in 3.725s
svd_solver='randomized'
done in 1.232s

 n_components=100
svd_solver='full'
done in 18.813s
svd_solver='arpack'
done in 5.494s
svd_solver='randomized'
done in 0.611s

 n_components=500
svd_solver='full'
done in 18.712s
svd_solver='arpack'
done in 18.547s
svd_solver='randomized'
done in 2.030s

 n_components=1000
svd_solver='full'
done in 19.876s
svd_solver='arpack'
done in 63.463s
svd_solver='randomized'
done in 5.219s

 n_components=2400
svd_solver='full'
done in 19.190s
svd_solver='arpack'
done in 137.910s
svd_solver='randomized'
done in 19.167s

 n_components=2999
svd_solver='full'
done in 18.724s
svd_solver='arpack'
done in 124.373s
svd_solver='randomized'
done in 28.858s

This support the chosen auto mode, at least with regard of runtime. We could run test on accuracy if needed.

giorgiop · 2015-09-30T12:36:25Z

exact may be a better name for the full PCA. Let me know also if you have better ideas instead of arpack.

TomDLT · 2015-09-30T12:47:24Z

sklearn/decomposition/pca.py

@@ -166,9 +193,9 @@ class PCA(BaseEstimator, TransformerMixin):
        http://www.miketipping.com/papers/met-mppca.pdf. It is required to
        computed the estimated data covariance and score samples.

-    Notes
+    References


----- -> ----------

ogrisel · 2015-10-02T20:00:58Z

instead of 'full' I would use 'lapack'.

ogrisel · 2015-10-02T20:54:08Z

sklearn/decomposition/pca.py

-    def inverse_transform(self, X):
-        """Transform data back to its original space, i.e.,
-        return an input X_original whose transform would be X
+        if type(n_components) == str:


Better test:

if isinstance(n_components, six.string_types):

ogrisel · 2015-10-02T21:05:22Z

instead of 'full' I would use 'lapack'.

In retrospect, "full" is a good name but it would be worth mentioning that this is using the standard SVD solver from LAPACK in the docstring.

ogrisel · 2015-10-02T21:07:11Z

sklearn/decomposition/pca.py

+        # Handle svd_solver
+        svd_solver = self.svd_solver
+        if svd_solver == 'auto':
+            if n_components < .8 * min(X.shape):


It should be:

if n_components >=1 and n_components < .8 * min(X.shape):

instead. n_components in [0, 1] means using the explained variance ratio to select the actual number of components once a full SVD has been performed.

MechCoder · 2016-03-10T14:22:54Z

@giorgiop Better to use np.allclose or assert_array_almost)equal then :P

MechCoder · 2016-03-10T16:21:55Z

@giorgiop Can you rebase over master?

giorgiop · 2016-03-10T21:15:57Z

@giorgiop Better to use np.allclose or assert_array_almost)equal then :P

Ahah! Reverting ...

[MRG+3] Collapsing PCA and RandomizedPCA

MechCoder · 2016-03-10T23:15:30Z

Merged !!!!

A3E2

giorgiop · 2016-03-10T23:22:41Z

Fantastic. Great team effort!

GaelVaroquaux · 2016-03-11T06:36:39Z

Fantastic. Great team effort!

Whoooot!!!!!! Thank you, in the name of global warming, thank you for your contribution to making everybody's code run faster.

amueller · 2016-07-21T15:21:07Z

Hm by making "auto" the default, this was a backward-incompatible change, right? It's a great change, but looking back, usually we are more careful, right?

giorgiop · 2016-07-22T03:35:39Z

You are right. We hadn't thought of that at the time!

cbrnr · 2016-08-18T09:22:16Z

Also, could it be that the resulting components can now have a different sign because svd_flip is now always called? This also leads to different results as compared to 0.17.1...

giorgiop · 2016-08-18T18:41:46Z

@cbrnr #6135 ?

cbrnr · 2016-08-19T06:36:44Z

No, I meant that with the introduction of svd_flip, the resulting PCA components will be different (in general) as compared to the previous version in 0.17.1. For example (run that with 0.17.1 and then with the current master):

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


np.random.seed(1)
x = np.random.rand(10, 5000)
x[::2] += 0.5 * x[::2] + 2 * np.random.rand(5, 5000)
x[-1, :] *= 10

pca = PCA()
pca.fit(x.T)
plt.imshow(pca.components_, interpolation="nearest")
plt.title("pca.components_ sklearn master")

The results are as follows:

Apparently, rows 4, 5, and 9 have been flipped in the new PCA version. I think it is really bad when results suddenly change without prior notice. Especially for such a run-of-the-mill method like PCA.

amueller · 2016-08-25T18:18:30Z

It is a bit unfortunate but I think it will be hard for the results to stay exactly the same. @cbrnr can you point to where the svd_flip is now called where it wasn't before? @giorgiop why was that for consistency between the different solvers? If a different solver is chosen, we can not really expect the exact same result, though a close result would be good.
Though calling the sign flip more often allows at least some kind of consistent behavior.

@cbrnr I would argue that relying on the signs of eigenvectors is a really bad idea, and if you expect them to stay the same, you run into trouble either way. Without the sign flip, the outcome is random any how (created by minor choices in the SVD solver). Why did this come up?

[I totally got bitten by this when writing my book btw]

GaelVaroquaux · 2016-08-26T07:18:35Z

@cbrnr I would argue that relying on the signs of eigenvectors is a really bad idea, and if you expect them to stay the same, you run into trouble either way. Without the sign flip, the outcome is random any how (created by minor choices in the SVD solver).

+1. It's in the range of numerical instabilities.

cbrnr · 2016-08-28T14:28:24Z

@amueller I'm not saying that the current solution using svd_flip is not a good solution. In fact, I very much appreciate that there is a deterministic rule how the sign is determined. I'm just saying that it is not good practise that results of a standard algorithm suddenly change without prior notice, that's all (I spent some time figuring out why I suddenly got different results even though I hadn't changed anything in my code). Anyway, maybe a warning or a note in the next release would be helpful (preferably one that is displayed when using PCA).

GaelVaroquaux · 2016-08-28T14:37:33Z

I'm just saying that it is not good practise that results of a standard algorithm suddenly change without prior notice, that's all (I spent some time figuring out why I suddenly got different results even though I hadn't changed anything in my code).

While I very strongly agree that results of algorithms should not be changed without warning from a release to another, it this specific case, the sign of the PCA is undetermined. It might flip from one computer to another, or when changing linear algebra underlying libraries.

Anyway, maybe a warning or a note in the next release would be helpful (preferably one that is displayed when using PCA).

Once again, I find that a warning here would not be a good thing, because anyhow, the result was undefined previously. So people would be getting warnings about undefined behavior. That said, a note in the release notes explaining the problem would be good. Do you want to do a pull request? It's in doc/whats_new.rst

10000

cbrnr · 2016-08-28T14:57:51Z

Yes, you're right, in this case the results were indeed undetermined. I'll work on the PR explaining the problem in the docs.

GaelVaroquaux · 2016-08-28T14:59:54Z

I'll work on the PR explaining the problem in the docs.

Thanks, that's extremely useful!

amueller · 2016-08-28T16:41:37Z

maybe a note in the API changes section (maybe that section needs a better title - i feel it should be "things you need to be careful about when updating" - only shorter ;)

amueller · 2016-08-28T16:42:04Z

@cbrnr and I think we all agree that unexpected changes are always bad. We try our best to avoid them.

cbrnr · 2016-08-28T18:44:04Z

@amueller Maybe "Important changes" instead of API changes? Anyway, I'll add something to this section. Also, I didn't want to complain. Of course such (and much worse) things happen and I'm happy to contribute!

amueller · 2016-08-31T18:45:40Z

sklearn/utils/extmath.py

@@ -351,6 +355,12 @@ def randomized_svd(M, n_components, n_oversamples=10, n_iter=2,
        # this implementation is a bit faster with smaller shape[1]
        M = M.T

+    # Adjust n_iter. 7 was found a good compromise for PCA. See #5299
+    if n_components < .1 * min(M.shape) and n_iter < 7:


This behavior is really strange. I'm trying to fix it in #7311.

abhigenie92 · 2018-12-14T21:51:23Z

I don't understand why running SVD twice on the same matrix can lead to principal components with signs flipped.

giorgiop force-pushed the pca+randomizedpca branch 3 times, most recently from b66eebf to c6b93f0 Compare September 23, 2015 12:16

giorgiop force-pushed the pca+randomizedpca branch 4 times, most recently from 141f858 to bf6416a Compare September 30, 2015 07:52

giorgiop mentioned this pull request Sep 30, 2015

FIX/ENH?: PCA and RandomizedPCA now have inconsistent meanings of .components_ #3930

Closed

giorgiop force-pushed the pca+randomizedpca branch 2 times, most recently from 08f736f to 669f136 Compare September 30, 2015 08:52

TomDLT reviewed Sep 30, 2015
View reviewed changes

giorgiop force-pushed the pca+randomizedpca branch from a62d139 to 9bb6750 Compare September 30, 2015 12:09

giorgiop force-pushed the pca+randomizedpca branch from 9bb6750 to b8a42a9 Compare September 30, 2015 12:19

giorgiop changed the title ~~[WIP] Collapsing PCA and RandomizedPCA~~ [MRG] Collapsing PCA and RandomizedPCA Sep 30, 2015

TomDLT reviewed Sep 30, 2015
View reviewed changes

giorgiop force-pushed the pca+randomizedpca branch from b8a42a9 to e7c9b36 Compare October 1, 2015 14:23

ogrisel reviewed Oct 2, 2015
View reviewed changes

ENH RandomizedPCA collapsed into PCA

328ebfa

giorgiop force-pushed the pca+randomizedpca branch from 1195df9 to 328ebfa Compare March 10, 2016 22:11

MechCoder added a commit that referenced this pull request Mar 10, 2016

Merge pull request #5299 from giorgiop/pca+randomizedpca

25c931a

[MRG+3] Collapsing PCA and RandomizedPCA

MechCoder merged commit 25c931a into scikit-learn:master Mar 10, 2016

amueller mentioned this pull request May 2, 2016

Weird / unhelpful "number of power iterations" message #6746

Closed

amueller mentioned this pull request Aug 31, 2016

[MRG+2] fix warning and behavior in randomized_svd wrt power iterations #7311

Closed

amueller reviewed Aug 31, 2016
View reviewed changes

TomAugspurger mentioned this pull request Oct 11, 2017

PCA, TruncatedSVD Transformers dask/dask-ml#22

Closed

pierretallotte mentioned this pull request Dec 16, 2018

MDS: in the example, the reconstruction may sometimes be inversed compared to the real solution. #988

Closed

Uh oh!

[MRG+3] Collapsing PCA and RandomizedPCA #5299

[MRG+3] Collapsing PCA and RandomizedPCA #5299

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!