-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+3] Collapsing PCA and RandomizedPCA #5299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b66eebf
to
c6b93f0
Compare
I have the feeling that |
_BasePCA was originally introduced at the same time as IncrementalPCA with On Wed, Sep 23, 2015 at 9:10 AM, Olivier Grisel notifications@github.com
|
Ah ok sorry for the confusion. I thought it was about the old |
Inheritance from |
141f858
to
bf6416a
Compare
We have to decide what to do with this behaviour
This paragraph was in PCA's docstring. RandomizedPCA handled it calling To note: I have been working on the performance of |
08f736f
to
669f136
Compare
return X_original | ||
@deprecated("it will be removed in 0.19. Use PCA(svd_solver='randomized') " | ||
"instead. The new implementation DOES NOT store" | ||
"whithen components_. Apply transform to get them.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whiten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
a62d139
to
9bb6750
Compare
I agreed with @ogrisel to call |
9bb6750
to
b8a42a9
Compare
Simple runtime benchmark script
Output:
This support the chosen |
|
@@ -166,9 +193,9 @@ class PCA(BaseEstimator, TransformerMixin): | |||
http://www.miketipping.com/papers/met-mppca.pdf. It is required to | |||
computed the estimated data covariance and score samples. | |||
|
|||
Notes | |||
References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-----
-> ----------
b8a42a9
to
e7c9b36
Compare
instead of 'full' I would use 'lapack'. |
def inverse_transform(self, X): | ||
"""Transform data back to its original space, i.e., | ||
return an input X_original whose transform would be X | ||
if type(n_components) == str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better test:
if isinstance(n_components, six.string_types):
In retrospect, "full" is a good name but it would be worth mentioning that this is using the standard SVD solver from LAPACK in the docstring. |
# Handle svd_solver | ||
svd_solver = self.svd_solver | ||
if svd_solver == 'auto': | ||
if n_components < .8 * min(X.shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be:
if n_components >=1 and n_components < .8 * min(X.shape):
instead. n_components in [0, 1]
means using the explained variance ratio to select the actual number of components once a full SVD has been performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right
@giorgiop Better to use np.allclose or assert_array_almost)equal then :P |
@giorgiop Can you rebase over master? |
Ahah! Reverting ... |
1195df9
to
328ebfa
Compare
[MRG+3] Collapsing PCA and RandomizedPCA
Merged !!!! |
Fantastic. Great team effort! |
Fantastic. Great team effort!
Whoooot!!!!!! Thank you, in the name of global warming, thank you for
your contribution to making everybody's code run faster.
|
Hm by making |
You are right. We hadn't thought of that at the time! |
Also, could it be that the resulting components can now have a different sign because |
No, I meant that with the introduction of import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
np.random.seed(1)
x = np.random.rand(10, 5000)
x[::2] += 0.5 * x[::2] + 2 * np.random.rand(5, 5000)
x[-1, :] *= 10
pca = PCA()
pca.fit(x.T)
plt.imshow(pca.components_, interpolation="nearest")
plt.title("pca.components_ sklearn master") Apparently, rows 4, 5, and 9 have been flipped in the new PCA version. I think it is really bad when results suddenly change without prior notice. Especially for such a run-of-the-mill method like PCA. |
It is a bit unfortunate but I think it will be hard for the results to stay exactly the same. @cbrnr can you point to where the svd_flip is now called where it wasn't before? @giorgiop why was that for consistency between the different solvers? If a different solver is chosen, we can not really expect the exact same result, though a close result would be good. @cbrnr I would argue that relying on the signs of eigenvectors is a really bad idea, and if you expect them to stay the same, you run into trouble either way. Without the sign flip, the outcome is random any how (created by minor choices in the SVD solver). Why did this come up? [I totally got bitten by this when writing my book btw] |
@cbrnr I would argue that relying on the signs of eigenvectors is a really bad
idea, and if you expect them to stay the same, you run into trouble either way.
Without the sign flip, the outcome is random any how (created by minor choices
in the SVD solver).
+1. It's in the range of numerical instabilities.
|
@amueller I'm not saying that the current solution using |
I'm just saying that it is not good practise that results of a standard
algorithm suddenly change without prior notice, that's all (I spent
some time figuring out why I suddenly got different results even though
I hadn't changed anything in my code).
While I very strongly agree that results of algorithms should not be
changed without warning from a release to another, it this specific case,
the sign of the PCA is undetermined. It might flip from one computer to
another, or when changing linear algebra underlying libraries.
Anyway, maybe a warning or a note in the next release would be helpful
(preferably one that is displayed when using PCA).
Once again, I find that a warning here would not be a good thing, because
anyhow, the result was undefined previously. So people would be getting
warnings about undefined behavior.
That said, a note in the release notes explaining the problem would be
good. Do you want to do a pull request? It's in doc/whats_new.rst
|
Yes, you're right, in this case the results were indeed undetermined. I'll work on the PR explaining the problem in the docs. |
I'll work on the PR explaining the problem in the docs.
Thanks, that's extremely useful!
|
maybe a note in the API changes section (maybe that section needs a better title - i feel it should be "things you need to be careful about when updating" - only shorter ;) |
@cbrnr and I think we all agree that unexpected changes are always bad. We try our best to avoid them. |
@amueller Maybe "Important changes" instead of API changes? Anyway, I'll add something to this section. Also, I didn't want to complain. Of course such (and much worse) things happen and I'm happy to contribute! |
@@ -351,6 +355,12 @@ def randomized_svd(M, n_components, n_oversamples=10, n_iter=2, | |||
# this implementation is a bit faster with smaller shape[1] | |||
M = M.T | |||
|
|||
# Adjust n_iter. 7 was found a good compromise for PCA. See #5299 | |||
if n_components < .1 * min(M.shape) and n_iter < 7: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is really strange. I'm trying to fix it in #7311.
I don't understand why running SVD twice on the same matrix can lead to principal components with signs flipped. |
Fixes #5243 and #3930.
to do
svd_solver=='arpack'
auto
policyIncrementalPCA
by inheritance from_BasePCA
; see [MRG+2] Incremental PCA #3285addwe flip by default, without introducing a param for controlling the behaviourflip_sign
param, True by default