-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Implement randomized PCA #12841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b5964c1
to
aa0b093
Compare
Thanks. I made a few comments but looks good overall. |
201ff7d
to
fec47b6
Compare
@jeremiedbb Thanks for the helpful comments! I think I've addressed them all now. |
fec47b6
to
fe3d3a8
Compare
f39254e
to
f453f5b
Compare
f453f5b
to
99c1b1d
8000
Compare
Hi, @pavlin-policar . |
@Koncopd Thanks for bringing attention to this - I had no idea this is an issue. I am very surprised that scipy would overlook this issue. I ran a test of my own, with a sparser data set (0.01 non-zero values) and the differences weren't quite as drastic: Not only is it less efficient in memory, but it's also slower as well! I'll be sure to address this asap. |
91e9596
to
33574d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round, mostly nitpicks.
I'll try to provide more in-depth feedback soon!
@NicolasHug Thanks for your feedback, the only thing I have to go on is the existing code. If there are any other stylistic changes I need to address, please let me know. I've written a note about how sparse matrices are handled in the PCA docstring, but I feel as though it could be put better. If I come up with better wording, I'll push that up as well. |
e0d4180
to
473b940
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.
I hope we can find a way to refactor this a bit ^^
elif power_iteration_normalizer == 'QR': | ||
Q, _ = linalg.qr(safe_sparse_dot(A, Q), mode='economic') | ||
Q, _ = linalg.qr(safe_sparse_dot(A.T, Q), mode='economic') | ||
Q = safe_sparse_dot(A, Q) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice refactoring to have.
It's not from you but could you please fix the docstring of this function: it says the returned Q has shape (size, size)
but it's actually (A.shape[0], size)
power_iteration_normalizer) | ||
|
||
|
||
def randomized_pca(A, n_components, n_oversamples=10, n_iter="auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to make it public?
IMO this should be private (and randomized_svd should also be private), as it'd be easier to change the interface if we need to.
Let's see what core devs say
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't really mind either way, but pylint will complain when we try to access extmath._randomized_pca
from sklearn.decomposition.PCA
.
def randomized_pca(A, n_components, n_oversamples=10, n_iter="auto", | ||
power_iteration_normalizer="auto", flip_sign=True, | ||
random_state=0): | ||
"""Computes a truncated randomized PCA decomposition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCA is not really a decomposition so I wouldn't call PCA decomposition.
Also I think it'd be good to explicitly say that this method is only used to compute the PCA of sparse matrices.
May I suggest:
- renaming
randomized_pca
intorandomized_pca_sparse
- changing the docstring into:
"""Compute a truncated randomized SVD decomposition
This is similar to randomized_svd, but it is used for computing the
PCA of sparse matrices. We avoid explicitly centering the sparse matrix A,
which would make it dense.
"""
Of course that would only make sense if you agree with my other comment about only using this function for sparse matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCA is not really a decomposition so I wouldn't call PCA decomposition.
Isn't it? I always thought of PCA as a decomposition of variance. The matrix is "decomposed" into loadings and the directions of highest variance.
Would PCA projection be better?
renaming
randomized_pca
intorandomized_pca_sparse
Some renaming might fine, but it might be worth pointing out that the only real difference between randomized_svd
and randomized_pca
is that randomized_pca
computes the SVD of the centered x
matrix, without explicitly needing to center it. So basically, if you pre-centered x
, the results would be exactly the same from randomized_svd
as from randomized_pca
. And since centering dense matrices is cheap, you'd only ever do this on sparse matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCA is not really a decomposition
What I meant is that it's not a decomposition (or factorization) like SVD is: It does not decomposes your matrix into a product of other matrices. It just computes a projection matrix. But you're right that it decomposes the variance.
I 100% agree with you about the difference between randomized_svd and randomized_pca, hence my suggestion about the naming, and the docstring explaining all this.
means, _ = mean_variance_axis(A, axis=0) | ||
else: | ||
means = np.mean(A, axis=0) | ||
c = np.atleast_2d(means) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this means
instead of c
@@ -519,7 +528,7 @@ def _fit_truncated(self, X, n_components, svd_solver): | |||
|
|||
elif svd_solver == 'randomized': | |||
# sign flipping is done inside | |||
U, S, V = randomized_svd(X, n_components=n_components, | |||
U, S, V = randomized_pca(X, n_components=n_components, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should play it safe here and only call randomized_pca
when X
is sparse. Who knows, there might be bugs in randomized_pca
, and randomized_svd
is probably faster for dense matrices.
means = np.mean(A, axis=0) | ||
c = np.atleast_2d(means) | ||
|
||
if n_samples >= n_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section
Q = random_state.normal(
size=(n_features, n_components + n_oversamples)
)
if A.dtype.kind == "f":
Q = Q.astype(A.dtype, copy=False)
should be above the if / else
blocks.
You probably put more thoughts into this than me, but are you sure there's no way to factorize these 2 cases (n_samples >= n_features and n_samples < n_features) into a single one, like in randomized_svd
? My hope is that if we can factorize these, then maybe we can extend randomized_svd
instead of having a whole new function that does almost the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found a good way to refactor the 2 use cases (not for lack of trying) is that the math changes. In one case, we compute the mean of x
on axis 0 and the other case on axis 1. Also, the order of the power iteration operations change as well. Maybe there's some algebra you could do to rearrange things, but I'm not smart enough for that. I've thought about refactoring it in some way, but anything I come up with requires a whole lot of if
s which will just make things unreadable.
|
||
Q, _ = linalg.qr(Q, mode="economic") | ||
|
||
QA = safe_sparse_dot(A.T, Q) - \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please try not to deviate from the randomized_svd
names because this makes comparison beetween the 2 methods more complicated than it could be.
QA.T
should be B
, R
should be Uhat
(or Vhat
), and the matrix A
should be M
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed tests
pca = PCA(n_components=2, svd_solver='full', random_state=rng).fit(X) | ||
apca = PCA(n_components=2, svd_solver='arpack', random_state=rng).fit(X) | ||
# Increase the number of power iterations to get greater accuracy in tests | ||
rpca = PCA(n_components=2, svd_solver='randomized', iterated_power=40, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole test_singular_values()
test should not be changed IMO.
If you have to set iterated_power=40
for the test to pass now, that's a regression, and that confirms my suggestion: randomized_pca
should only be called for sparse matrices. Dense matrices should still use randomized_svd
.
Also in general, please avoid style fixes that are not related to the issue (like spaces between **
, etc). I know you mean well but then it makes it harder to determine what changes are important during the review.
n_samples = 100 | ||
n_features = 80 | ||
|
||
X = rng.binomial(1, 0.1, (n_samples, n_features)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: use size=
X = np.random.RandomState(0).rand(5, 4) | ||
X = sp.sparse.csr_matrix(X) | ||
assert(sp.sparse.issparse(X)) | ||
|
||
pca = PCA(n_components=3, svd_solver=svd_solver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pca = PCA(n_components=3, svd_solver=svd_solver) | |
pca = PCA(svd_solver=svd_solver) |
X = np.random.RandomState(0).rand(5, 4) | ||
X = sp.sparse.csr_matrix(X) | ||
|
||
pca = PCA(n_components=3, svd_solver='auto') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pca = PCA(n_components=3, svd_solver='auto') | |
pca = PCA(svd_solver='auto') |
|
||
@pytest.mark.parametrize('dtype', | ||
(np.int32, np.int64, np.float32, np.float64)) | ||
def test_randomized_pca_low_rank(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider integrating these 3 tests into the original ones from randomize_svd
. I think that parametrizing the tests with either randomized_svd
and randomized_pca
, and centering X accordingly could do it pretty easily.
I'm concerned about so much duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed tests
I feel like this PR could benefit from using |
I am wondering what the status of this PR is? It would be great to see it merged. It's almost an embarrassment that scikit does not have randomized PCA for sparse matrices! |
@NicolasHug's comments imply that it is awaiting some work from @pavlin-policar. @pavlin-policar, would you rather someone else take over and complete this work? |
Sorry about that, I've been quite busy for the past couple of months and to be quite honest, this wasn't high on my priorities. @NicolasHug suggested several minor things, which would be quick to fix, but this
is worrying and might suggest a bug, or it might not. Maybe this algorithm simply requires more iterations to converge (which seems plausible as well). Either way, I need to look into this a bit more. If somebody would be willing to take over this, that would be great as well. I likely won't have a lot of time to dedicate for this for another month. I think the only major thing left remaining is the iterations I outlined above. |
From what I can tell, this line will actually form the dense covariance matrix. One way to compute PCA is by computing the eigenvectors of the covariance matrix, but that is generally very inefficient because it involves forming the covariance matrix. Instead, by doing a bit of algebra we can compute the eigenvectors of the covariance matrix from So I don't think this is applicable here. |
@dkobak , there are better ways to indicate your interest in a feature. Taking over this PR would be the best one. Thanks. |
@NicolasHug Apologies if my comment came across as snarky, that was not my intention. But I was genuinely surprised that scikit does not have this feature, given that it has randomized SVD implemented and the math to add "implicit" centering is quite straightforward (as Pavlin rightly noticed). Anyway, what I really meant is that it'd be really great to see this implemented. And I think Pavlin is doing a great job here. |
#12319 would be a (faster) alternative to 'randomized' for sparse matrices that also allows running matrix-free SVD, i.e. on a matrix, given by a function. In particular, performing implicit data centering, without changing the data. For that matter, the ARPACK SVD solver is also matrix-free by design. Thus, one should really make the implicit data centering to be solver agnostic, allowing any matrix-free SVD solver to be used, rather than concentrating just on 'randomized' |
Nowadays, we have the |
Reference Issues/PRs
Following the discussion in #12794, I've changed the randomized SVD implementation so it now supports computing PCA as well, without ever having to explicitly centering X. This is especially useful for large sparse matrices, which can't be centered.
What does this implement/fix? Explain your changes.
Add degrees of freedom (ddof) parameters to sparse functions that compute variance. Error checking has been added for negative ddof, when
ddof >= n_samples
or when the data contain NaNs andn_samples - ddof - num_nans <= 0
.Implement the randomized PCA algorithm. I was unable to reuse the
randomized_svd
(although the algorithms are conceptually very similar, their code is slightly different).In the case where there are more features than samples, it is more efficient to transpose the matrix and work with that. However, simply transposing isn't ok, because we also need the means from the original columns. This would make
randomized_svd
quite complicated. Furthermore,randomized_svd
is strongly tied torandomized_range_finder
, and I would need to add similar changes to both functions, making them quite confusing. For these reasons, I strongly feel adding the similarrandomized_pca
function is justified.Any other comments?
The previous behaviour raised a
TypeError
if a sparse matrix was passed toPCA.fit
. The behaviour is now somewhat different. Sparse matrices are supported withsvd_solver='randomized'
, otherwise aValueError
is raised indicating the wrong solver. If the solver is set toauto
then therandomized
solver is always selected for sparse matrices.