8000 [MRG+3] Collapsing PCA and RandomizedPCA by giorgiop · Pull Request #5299 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+3] Collapsing PCA and RandomizedPCA #5299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 10, 2016

Conversation

giorgiop
Copy link
Contributor

Fixes #5243 and #3930.

to do

  • collapse the two classes
  • old tests passing
  • integrate svd_solver=='arpack'
  • benchmark the 3 solvers and establish the best auto policy
  • fix docstrings
  • uniform with IncrementalPCA by inheritance from _BasePCA; see [MRG+2] Incremental PCA #3285
  • add flip_sign param, True by default we flip by default, without introducing a param for controlling the behaviour
  • arpack backport updated

@giorgiop giorgiop force-pushed the pca+randomizedpca branch 3 times, most recently from b66eebf to c6b93f0 Compare September 23, 2015 12:16
@ogrisel
Copy link
Member
ogrisel commented Sep 23, 2015

I have the feeling that IncrementalPCA has very few common code to share with the regular PCA / _BasePCA classes. IMO we should not try to use inheritance when there is no significant amount of code to reuse. The indirections introduced by inheritance can make it harder to read the code so it has to be justified by code reuse.

@kastnerkyle
Copy link
Member

_BasePCA was originally introduced at the same time as IncrementalPCA with
an eye toward refactoring PCA to share this common code - most of those
methods should hold for PCA, since they were taken from there originally.
If it is no longer the case that we want to try and unify the two (IncPCA
and PCA), it may be worth just putting _BasePCA back onto IncrementalPCA
and keeping the code separate.

On Wed, Sep 23, 2015 at 9:10 AM, Olivier Grisel notifications@github.com
wrote:

I have the feeling that IncrementalPCA has very few common code to share
with the regular PCA / _BasePCA classes. IMO we should not try to use
inheritance when there is no significant amount of code to reuse. Using
inheritance can make it harder to read the code so it has to be justified
by code reuse.


Reply to this email directly or view it on GitHub
#5299 (comment)
.

@ogrisel
Copy link
Member
ogrisel commented Sep 23, 2015

Ah ok sorry for the confusion. I thought it was about the old ProbabilisticPCA that was actually already collapsed into PCA.

@giorgiop
Copy link
Contributor Author

Inheritance from _BasePCA was smooth and is now in the branch.

@giorgiop
Copy link
Contributor Author

We have to decide what to do with this behaviour

Due to implementation subtleties of the Singular Value Decomposition (SVD),
which is used in this implementation, running it twice on the same matrix
can lead to principal components with signs flipped (change in direction).
For this reason, it is important to always use the same estimator object to
transform data in a consistent fashion.

This paragraph was in PCA's docstring. RandomizedPCA handled it calling svd_flip in randomized_SVD. We can either call it for all the svd_solver methods or None.

To note: I have been working on the performance of randomized_svd in #5141 and svd_flip can be expensive becuase of this operation.

@giorgiop giorgiop force-pushed the pca+randomizedpca branch 2 times, most recently from 08f736f to 669f136 Compare September 30, 2015 08:52
8000
return X_original
@deprecated("it will be removed in 0.19. Use PCA(svd_solver='randomized') "
"instead. The new implementation DOES NOT store"
"whithen components_. Apply transform to get them.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whiten

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@giorgiop
Copy link
Contributor Author

I agreed with @ogrisel to call svd_flip in every method. We choose not to expose a flip_sign param, so as to keep lighter signature for the constructor of PCA.

@giorgiop
Copy link
Contributor Author

Simple runtime benchmark script

import numpy as np
from sklearn.decomposition.pca import PCA
from time import time
import gc

rng = np.random.RandomState(42)
X = np.random.uniform(-1, 1, size=(3000, 3000))

solver_list = ['full', 'arpack', 'randomized']

def bench_pca(X, pca):
    gc.collect()
    t0 = time()
    pca.fit(X)
    return time() - t0

for n_comp in [10, 50, 100, 500, 1000, 2400, 2999]:
    print("\n n_components=%d" % n_comp)
    for solver in solver_list:

        print("svd_solver='%s'" % solver)
        duration = bench_pca(X, PCA(n_components=n_comp, svd_solver=solver))
        print("done in %0.3fs" % duration)

Output:

 n_components=10
svd_solver='full'
done in 17.718s
svd_solver='arpack'
done in 3.084s
svd_solver='randomized'
done in 0.588s

 n_components=50
svd_solver='full'
done in 19.997s
svd_solver='arpack'
done in 3.725s
svd_solver='randomized'
done in 1.232s

 n_components=100
svd_solver='full'
done in 18.813s
svd_solver='arpack'
done in 5.494s
svd_solver='randomized'
done in 0.611s

 n_components=500
svd_solver='full'
done in 18.712s
svd_solver='arpack'
done in 18.547s
svd_solver='randomized'
done in 2.030s

 n_components=1000
svd_solver='full'
done in 19.876s
svd_solver='arpack'
done in 63.463s
svd_solver='randomized'
done in 5.219s

 n_components=2400
svd_solver='full'
done in 19.190s
svd_solver='arpack'
done in 137.910s
svd_solver='randomized'
done in 19.167s

 n_components=2999
svd_solver='full'
done in 18.724s
svd_solver='arpack'
done in 124.373s
svd_solver='randomized'
done in 28.858s

This support the chosen auto mode, at least with regard of runtime. We could run test on accuracy if needed.

@giorgiop giorgiop changed the title [WIP] Collapsing PCA and RandomizedPCA [MRG] Collapsing PCA and RandomizedPCA Sep 30, 2015
@giorgiop
Copy link
Contributor Author

exact may be a better name for the full PCA. Let me know also if you have better ideas instead of arpack.

@@ -166,9 +193,9 @@ class PCA(BaseEstimator, TransformerMixin):
http://www.miketipping.com/papers/met-mppca.pdf. It is required to
computed the estimated data covariance and score samples.

Notes
References
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- -> ----------

@ogrisel
Copy link
Member
ogrisel commented Oct 2, 2015

instead of 'full' I would use 'lapack'.

def inverse_transform(self, X):
"""Transform data back to its original space, i.e.,
return an input X_original whose transform would be X
if type(n_components) == str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better test:

if isinstance(n_components, six.string_types):

@ogrisel
Copy link
Member
ogrisel commented Oct 2, 2015

instead of 'full' I would use 'lapack'.

In retrospect, "full" is a good name but it would be worth mentioning that this is using the standard SVD solver from LAPACK in the docstring.

# Handle svd_solver
svd_solver = self.svd_solver
if svd_solver == 'auto':
if n_components < .8 * min(X.shape):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be:

if n_components >=1 and n_components < .8 * min(X.shape):

instead. n_components in [0, 1] means using the explained variance ratio to select the actual number of components once a full SVD has been performed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

@MechCoder
Copy link
Member

@giorgiop Better to use np.allclose or assert_array_almost)equal then :P

@MechCoder
Copy link
Member

@giorgiop Can you rebase over master?

@giorgiop
Copy link
Contributor Author

@giorgiop Better to use np.allclose or assert_array_almost)equal then :P

Ahah! Reverting ...

@giorgiop giorgiop force-pushed the pca+randomizedpca branch from 1195df9 to 328ebfa Compare March 10, 2016 22:11
MechCoder added a commit that referenced this pull request Mar 10, 2016
[MRG+3] Collapsing PCA and RandomizedPCA
@MechCoder MechCoder merged commit 25c931a into scikit-learn:master Mar 10, 2016
@MechCoder
Copy link
Member

Merged !!!!

A3E2

@giorgiop
Copy link
Contributor Author

Fantastic. Great team effort!

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Mar 11, 2016 via email

@amueller
Copy link
Member

Hm by making "auto" the default, this was a backward-incompatible change, right? It's a great change, but looking back, usually we are more careful, right?

@giorgiop
Copy link
Contributor Author

You are right. We hadn't thought of that at the time!

@cbrnr
Copy link
Contributor
cbrnr commented Aug 18, 2016

Also, could it be that the resulting components can now have a different sign because svd_flip is now always called? This also leads to different results as compared to 0.17.1...

@giorgiop
Copy link
Contributor Author

@cbrnr #6135 ?

@cbrnr
Copy link
Contributor
cbrnr commented Aug 19, 2016

No, I meant that with the introduction of svd_flip, the resulting PCA components will be different (in general) as compared to the previous version in 0.17.1. For example (run that with 0.17.1 and then with the current master):

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


np.random.seed(1)
x = np.random.rand(10, 5000)
x[::2] += 0.5 * x[::2] + 2 * np.random.rand(5, 5000)
x[-1, :] *= 10

pca = PCA()
pca.fit(x.T)
plt.imshow(pca.components_, interpolation="nearest")
plt.title("pca.components_ sklearn master")

The results are as follows:
sklearn_0 17 1
sklearn_master

Apparently, rows 4, 5, and 9 have been flipped in the new PCA version. I think it is really bad when results suddenly change without prior notice. Especially for such a run-of-the-mill method like PCA.

@amueller
Copy link
Member

It is a bit unfortunate but I think it will be hard for the results to stay exactly the same. @cbrnr can you point to where the svd_flip is now called where it wasn't before? @giorgiop why was that for consistency between the different solvers? If a different solver is chosen, we can not really expect the exact same result, though a close result would be good.
Though calling the sign flip more often allows at least some kind of consistent behavior.

@cbrnr I would argue that relying on the signs of eigenvectors is a really bad idea, and if you expect them to stay the same, you run into trouble either way. Without the sign flip, the outcome is random any how (created by minor choices in the SVD solver). Why did this come up?

[I totally got bitten by this when writing my book btw]

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Aug 26, 2016 via email

@cbrnr
Copy link
Contributor
cbrnr commented Aug 28, 2016

@amueller I'm not saying that the current solution using svd_flip is not a good solution. In fact, I very much appreciate that there is a deterministic rule how the sign is determined. I'm just saying that it is not good practise that results of a standard algorithm suddenly change without prior notice, that's all (I spent some time figuring out why I suddenly got different results even though I hadn't changed anything in my code). Anyway, maybe a warning or a note in the next release would be helpful (preferably one that is displayed when using PCA).

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Aug 28, 2016 via email
10000

@cbrnr
Copy link
Contributor
cbrnr commented Aug 28, 2016

Yes, you're right, in this case the results were indeed undetermined. I'll work on the PR explaining the problem in the docs.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Aug 28, 2016 via email

@amueller
Copy link
Member

maybe a note in the API changes section (maybe that section needs a better title - i feel it should be "things you need to be careful about when updating" - only shorter ;)

@amueller
Copy link
Member

@cbrnr and I think we all agree that unexpected changes are always bad. We try our best to avoid them.

@cbrnr
Copy link
Contributor
cbrnr commented Aug 28, 2016

@amueller Maybe "Important changes" instead of API changes? Anyway, I'll add something to this section. Also, I didn't want to complain. Of course such (and much worse) things happen and I'm happy to contribute!

@@ -351,6 +355,12 @@ def randomized_svd(M, n_components, n_oversamples=10, n_iter=2,
# this implementation is a bit faster with smaller shape[1]
M = M.T

# Adjust n_iter. 7 was found a good compromise for PCA. See #5299
if n_components < .1 * min(M.shape) and n_iter < 7:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is really strange. I'm trying to fix it in #7311.

@abhigenie92
Copy link
abhigenie92 commented Dec 14, 2018

I don't understand why running SVD twice on the same matrix can lead to principal components with signs flipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collapsing PCA and RandomizedPCA
0