8000 [MRG] Implement randomized PCA by pavlin-policar · Pull Request #12841 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Implement randomized PCA #12841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

pavlin-policar
Copy link
@pavlin-policar pavlin-policar commented Dec 20, 2018

Reference Issues/PRs

Following the discussion in #12794, I've changed the randomized SVD implementation so it now supports computing PCA as well, without ever having to explicitly centering X. This is especially useful for large sparse matrices, which can't be centered.

What does this implement/fix? Explain your changes.

  1. Add degrees of freedom (ddof) parameters to sparse functions that compute variance. Error checking has been added for negative ddof, when ddof >= n_samples or when the data contain NaNs and n_samples - ddof - num_nans <= 0.

  2. Implement the randomized PCA algorithm. I was unable to reuse the randomized_svd (although the algorithms are conceptually very similar, their code is slightly different).

    In the case where there are more features than samples, it is more efficient to transpose the matrix and work with that. However, simply transposing isn't ok, because we also need the means from the original columns. This would make randomized_svd quite complicated. Furthermore, randomized_svd is strongly tied to randomized_range_finder, and I would need to add similar changes to both functions, making them quite confusing. For these reasons, I strongly feel adding the similar randomized_pca function is justified.

Any other comments?

The previous behaviour raised a TypeError if a sparse matrix was passed to PCA.fit. The behaviour is now somewhat different. Sparse matrices are supported with svd_solver='randomized', otherwise a ValueError is raised indicating the wrong solver. If the solver is set to auto then the randomized solver is always selected for sparse matrices.

@pavlin-policar pavlin-policar force-pushed the pca-sparse branch 2 times, most recently from b5964c1 to aa0b093 Compare December 20, 2018 13:22
@pavlin-policar pavlin-policar changed the title Implement randomized PCA [MRG] Implement randomized PCA Dec 20, 2018
@jeremiedbb
Copy link
Member

Thanks. I made a few comments but looks good overall.

@pavlin-policar pavlin-policar force-pushed the pca-sparse branch 6 times, most recently from 201ff7d to fec47b6 Compare December 21, 2018 09:24
@pavlin-policar
Copy link
Author

@jeremiedbb Thanks for the helpful comments! I think I've addressed them all now.

@pavlin-policar pavlin-policar force-pushed the pca-sparse branch 2 times, most recently from f39254e to f453f5b Compare February 1, 2019 11:16
@Koncopd
Copy link
Koncopd commented Feb 4, 2019

Hi, @pavlin-policar .
I see that you use A.mean(0) here. I also used this method to get pca for sparse matrices without densifying and i found that A.mean(0) is not very efficient in terms of memory consumption.
image
Better to use csr_mean_variance_axis0 from sklearn.utils.sparsefuncs_fast.

@pavlin-policar
Copy link
Author
pavlin-policar commented Feb 4, 2019

@Koncopd Thanks for bringing attention to this - I had no idea this is an issue. I am very surprised that scipy would overlook this issue.

I ran a test of my own, with a sparser data set (0.01 non-zero values) and the differences weren't quite as drastic:

image

Not only is it less efficient in memory, but it's also slower as well!

I'll be sure to address this asap.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round, mostly nitpicks.

I'll try to provide more in-depth feedback soon!

@pavlin-policar
Copy link
Author

@NicolasHug Thanks for your feedback, the only thing I have to go on is the existing code. If there are any other stylistic changes I need to address, please let me know.

I've written a note about how sparse matrices are handled in the PCA docstring, but I feel as though it could be put better. If I come up with better wording, I'll push that up as well.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

I hope we can find a way to refactor this a bit ^^

elif power_iteration_normalizer == 'QR':
Q, _ = linalg.qr(safe_sparse_dot(A, Q), mode='economic')
Q, _ = linalg.qr(safe_sparse_dot(A.T, Q), mode='economic')
Q = safe_sparse_dot(A, Q)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice refactoring to have.

It's not from you but could you please fix the docstring of this function: it says the returned Q has shape (size, size) but it's actually (A.shape[0], size)

power_iteration_normalizer)


def randomized_pca(A, n_components, n_oversamples=10, n_iter="auto",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to make it public?

IMO this should be private (and randomized_svd should also be private), as it'd be easier to change the interface if we need to.

Let's see what core devs say

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't really mind either way, but pylint will complain when we try to access extmath._randomized_pca from sklearn.decomposition.PCA.

def randomized_pca(A, n_components, n_oversamples=10, n_iter="auto",
power_iteration_normalizer="auto", flip_sign=True,
random_state=0):
"""Computes a truncated randomized PCA decomposition.
Copy link
Member
@NicolasHug NicolasHug Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCA is not really a decomposition so I wouldn't call PCA decomposition.

Also I think it'd be good to explicitly say that this method is only used to compute the PCA of sparse matrices.

May I suggest:

  • renaming randomized_pca into randomized_pca_sparse
  • changing the docstring into:
"""Compute a truncated randomized SVD decomposition

   This is similar to randomized_svd, but it is used for computing the
   PCA of sparse matrices. We avoid explicitly centering the sparse matrix A,
   which would make it dense.
"""

Of course that would only make sense if you agree with my other comment about only using this function for sparse matrices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCA is not really a decomposition so I wouldn't call PCA decomposition.

Isn't it? I always thought of PCA as a decomposition of variance. The matrix is "decomposed" into loadings and the directions of highest variance.

Would PCA projection be better?

renaming randomized_pca into randomized_pca_sparse

Some renaming might fine, but it might be worth pointing out that the only real difference between randomized_svd and randomized_pca is that randomized_pca computes the SVD of the centered x matrix, without explicitly needing to center it. So basically, if you pre-centered x, the results would be exactly the same from randomized_svd as from randomized_pca. And since centering dense matrices is cheap, you'd only ever do this on sparse matrices.

Copy link
Member
@NicolasHug NicolasHug May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCA is not really a decomposition

What I meant is that it's not a decomposition (or factorization) like SVD is: It does not decomposes your matrix into a product of other matrices. It just computes a projection matrix. But you're right that it decomposes the variance.

I 100% agree with you about the difference between randomized_svd and randomized_pca, hence my suggestion about the naming, and the docstring explaining all this.

means, _ = mean_variance_axis(A, axis=0)
else:
means = np.mean(A, axis=0)
c = np.atleast_2d(means)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this means instead of c

@@ -519,7 +528,7 @@ def _fit_truncated(self, X, n_components, svd_solver):

elif svd_solver == 'randomized':
# sign flipping is done inside
U, S, V = randomized_svd(X, n_components=n_components,
U, S, V = randomized_pca(X, n_components=n_components,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should play it safe here and only call randomized_pca when X is sparse. Who knows, there might be bugs in randomized_pca, and randomized_svd is probably faster for dense matrices.

means = np.mean(A, axis=0)
c = np.atleast_2d(means)

if n_samples >= n_features:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section

       Q = random_state.normal(
            size=(n_features, n_components + n_oversamples)
        )
        if A.dtype.kind == "f":
            Q = Q.astype(A.dtype, copy=False)

should be above the if / else blocks.


You probably put more thoughts into this than me, but are you sure there's no way to factorize these 2 cases (n_samples >= n_features and n_samples < n_features) into a single one, like in randomized_svd? My hope is that if we can factorize these, then maybe we can extend randomized_svd instead of having a whole new function that does almost the same thing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found a good way to refactor the 2 use cases (not for lack of trying) is that the math changes. In one case, we compute the mean of x on axis 0 and the other case on axis 1. Also, the order of the power iteration operations change as well. Maybe there's some algebra you could do to rearrange things, but I'm not smart enough for that. I've thought about refactoring it in some way, but anything I come up with requires a whole lot of ifs which will just make things unreadable.


Q, _ = linalg.qr(Q, mode="economic")

QA = safe_sparse_dot(A.T, Q) - \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please try not to deviate from the randomized_svd names because this makes comparison beetween the 2 methods more complicated than it could be.

QA.T should be B, R should be Uhat (or Vhat), and the matrix A should be M.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed tests

pca = PCA(n_components=2, svd_solver='full', random_state=rng).fit(X)
apca = PCA(n_components=2, svd_solver='arpack', random_state=rng).fit(X)
# Increase the number of power iterations to get greater accuracy in tests
rpca = PCA(n_components=2, svd_solver='randomized', iterated_power=40,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole test_singular_values() test should not be changed IMO.

If you have to set iterated_power=40 for the test to pass now, that's a regression, and that confirms my suggestion: randomized_pca should only be called for sparse matrices. Dense matrices should still use randomized_svd.

Also in general, please avoid style fixes that are not related to the issue (like spaces between **, etc). I know you mean well but then it makes it harder to determine what changes are important during the review.

n_samples = 100
n_features = 80

X = rng.binomial(1, 0.1, (n_samples, n_features))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: use size=

X = np.random.RandomState(0).rand(5, 4)
X = sp.sparse.csr_matrix(X)
assert(sp.sparse.issparse(X))

pca = PCA(n_components=3, svd_solver=svd_solver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pca = PCA(n_components=3, svd_solver=svd_solver)
pca = PCA(svd_solver=svd_solver)

X = np.random.RandomState(0).rand(5, 4)
X = sp.sparse.csr_matrix(X)

pca = PCA(n_components=3, svd_solver='auto')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pca = PCA(n_components=3, svd_solver='auto')
pca = PCA(svd_solver='auto')


@pytest.mark.parametrize('dtype',
(np.int32, np.int64, np.float32, np.float64))
def test_randomized_pca_low_rank(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider integrating these 3 tests into the original ones from randomize_svd. I think that parametrizing the tests with either randomized_svd and randomized_pca, and centering X accordingly could do it pretty easily.

I'm concerned about so much duplicated code.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed tests

@NicolasHug
Copy link
Member

I feel like this PR could benefit from using _eigen_decompose_covariance that was introduced in #13350

@dkobak
Copy link
Contributor
dkobak commented May 29, 2019

I am wondering what the status of this PR is? It would be great to see it merged. It's almost an embarrassment that scikit does not have randomized PCA for sparse matrices!

@jnothman
Copy link
Member

@NicolasHug's comments imply that it is awaiting some work from @pavlin-policar. @pavlin-policar, would you rather someone else take over and complete this work?

@pavlin-policar
Copy link
Author

Sorry about that, I've been quite busy for the past couple of months and to be quite honest, this wasn't high on my priorities. @NicolasHug suggested several minor things, which would be quick to fix, but this

This whole test_singular_values() test should not be changed IMO.

If you have to set iterated_power=40 for the test to pass now, that's a regression, and that confirms my suggestion: randomized_pca should only be called for sparse matrices. Dense matrices should still use randomized_svd.

is worrying and might suggest a bug, or it might not. Maybe this algorithm simply requires more iterations to converge (which seems plausible as well). Either way, I need to look into this a bit more.

If somebody would be willing to take over this, that would be great as well. I likely won't have a lot of time to dedicate for this for another month. I think the only major thing left remaining is the iterations I outlined above.

@pavlin-policar
Copy link
Author

I feel like this PR could benefit from using _eigen_decompose_covariance

From what I can tell, this line will actually form the dense covariance matrix.

One way to compute PCA is by computing the eigenvectors of the covariance matrix, but that is generally very inefficient because it involves forming the covariance matrix.

Instead, by doing a bit of algebra we can compute the eigenvectors of the covariance matrix from x alone (never needing to actually form cov(x, x)!), which is what scikit-learn has done via randomized_svd.

So I don't think this is applicable here.

@NicolasHug
Copy link
Member

It's almost an embarrassment that scikit does not have randomized PCA for sparse matrices!

@dkobak , there are better ways to indicate your interest in a feature. Taking over this PR would be the best one. Thanks.

@dkobak
Copy link
Contributor
dkobak commented May 30, 2019

@NicolasHug Apologies if my comment came across as snarky, that was not my intention. But I was genuinely surprised that scikit does not have this feature, given that it has randomized SVD implemented and the math to add "implicit" centering is quite straightforward (as Pavlin rightly noticed). Anyway, what I really meant is that it'd be really great to see this implemented. And I think Pavlin is doing a great job here.

@lobpcg
Copy link
Contributor
lobpcg commented Jul 2, 2019

#12319 would be a (faster) alternative to 'randomized' for sparse matrices that also allows running matrix-free SVD, i.e. on a matrix, given by a function. In particular, performing implicit data centering, without changing the data. For that matter, the ARPACK SVD solver is also matrix-free by design.

Thus, one should really make the implicit data centering to be solver agnostic, allowing any matrix-free SVD solver to be used, rather than concentrating just on 'randomized'

@glemaitre
Copy link
Member

Nowadays, we have the arpack solver that handle the sparse case in PCA so let's close this PR.

pavlin-policar reacted with thumbs up emoji

@glemaitre glemaitre closed this May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0