[MRG] Implement randomized PCA #12841

pavlin-policar · 2018-12-20T12:46:40Z

Reference Issues/PRs

Following the discussion in #12794, I've changed the randomized SVD implementation so it now supports computing PCA as well, without ever having to explicitly centering X. This is especially useful for large sparse matrices, which can't be centered.

What does this implement/fix? Explain your changes.

Add degrees of freedom (ddof) parameters to sparse functions that compute variance. Error checking has been added for negative ddof, when ddof >= n_samples or when the data contain NaNs and n_samples - ddof - num_nans <= 0.
Implement the randomized PCA algorithm. I was unable to reuse the randomized_svd (although the algorithms are conceptually very similar, their code is slightly different).

In the case where there are more features than samples, it is more efficient to transpose the matrix and work with that. However, simply transposing isn't ok, because we also need the means from the original columns. This would make randomized_svd quite complicated. Furthermore, randomized_svd is strongly tied to randomized_range_finder, and I would need to add similar changes to both functions, making them quite confusing. For these reasons, I strongly feel adding the similar randomized_pca function is justified.

Any other comments?

The previous behaviour raised a TypeError if a sparse matrix was passed to PCA.fit. The behaviour is now somewhat different. Sparse matrices are supported with svd_solver='randomized', otherwise a ValueError is raised indicating the wrong solver. If the solver is set to auto then the randomized solver is always selected for sparse matrices.

sklearn/decomposition/tests/test_pca.py

sklearn/utils/tests/test_sparsefuncs.py

sklearn/utils/sparsefuncs.py

sklearn/decomposition/tests/test_pca.py

sklearn/decomposition/pca.py

jeremiedbb · 2018-12-20T15:45:17Z

Thanks. I made a few comments but looks good overall.

pavlin-policar · 2018-12-21T09:26:34Z

@jeremiedbb Thanks for the helpful comments! I think I've addressed them all now.

sklearn/decomposition/tests/test_pca.py

sklearn/utils/extmath.py

sklearn/utils/sparsefuncs.py

Koncopd · 2019-02-04T12:51:03Z

Hi, @pavlin-policar .
I see that you use A.mean(0) here. I also used this method to get pca for sparse matrices without densifying and i found that A.mean(0) is not very efficient in terms of memory consumption.

Better to use csr_mean_variance_axis0 from sklearn.utils.sparsefuncs_fast.

pavlin-policar · 2019-02-04T21:49:54Z

@Koncopd Thanks for bringing attention to this - I had no idea this is an issue. I am very surprised that scipy would overlook this issue.

I ran a test of my own, with a sparser data set (0.01 non-zero values) and the differences weren't quite as drastic:

Not only is it less efficient in memory, but it's also slower as well!

I'll be sure to address this asap.

NicolasHug

First round, mostly nitpicks.

I'll try to provide more in-depth feedback soon!

sklearn/decomposition/pca.py

sklearn/utils/sparsefuncs.py

sklearn/utils/sparsefuncs_fast.pyx

sklearn/decomposition/tests/test_pca.py

pavlin-policar · 2019-02-13T09:05:20Z

@NicolasHug Thanks for your feedback, the only thing I have to go on is the existing code. If there are any other stylistic changes I need to address, please let me know.

I've written a note about how sparse matrices are handled in the PCA docstring, but I feel as though it could be put better. If I come up with better wording, I'll push that up as well.

sklearn/decomposition/tests/test_pca.py

NicolasHug

Some more comments.

I hope we can find a way to refactor this a bit ^^

NicolasHug · 2019-02-15T14:12:40Z

sklearn/utils/extmath.py

-        elif power_iteration_normalizer == 'QR':
-            Q, _ = linalg.qr(safe_sparse_dot(A, Q), mode='economic')
-            Q, _ = linalg.qr(safe_sparse_dot(A.T, Q), mode='economic')
+        Q = safe_sparse_dot(A, Q)


This is a nice refactoring to have.

It's not from you but could you please fix the docstring of this function: it says the returned Q has shape (size, size) but it's actually (A.shape[0], size)

NicolasHug · 2019-02-15T14:35:23Z

sklearn/utils/extmath.py

+                         power_iteration_normalizer)
+
+
+def randomized_pca(A, n_components, n_oversamples=10, n_iter="auto",


Is there any reason to make it public?

IMO this should be private (and randomized_svd should also be private), as it'd be easier to change the interface if we need to.

Let's see what core devs say

I wouldn't really mind either way, but pylint will complain when we try to access extmath._randomized_pca from sklearn.decomposition.PCA.

NicolasHug · 2019-02-15T14:45:57Z

sklearn/utils/extmath.py

+def randomized_pca(A, n_components, n_oversamples=10, n_iter="auto",
+                   power_iteration_normalizer="auto", flip_sign=True,
+                   random_state=0):
+    """Computes a truncated randomized PCA decomposition.


PCA is not really a decomposition so I wouldn't call PCA decomposition.

Also I think it'd be good to explicitly say that this method is only used to compute the PCA of sparse matrices.

May I suggest:

renaming randomized_pca into randomized_pca_sparse

changing the docstring into:

"""Compute a truncated randomized SVD decomposition This is similar to randomized_svd, but it is used for computing the PCA of sparse matrices. We avoid explicitly centering the sparse matrix A, which would make it dense. """

Of course that would only make sense if you agree with my other comment about only using this function for sparse matrices.

PCA is not really a decomposition so I wouldn't call PCA decomposition.

Isn't it? I always thought of PCA as a decomposition of variance. The matrix is "decomposed" into loadings and the directions of highest variance.

Would PCA projection be better?

renaming randomized_pca into randomized_pca_sparse

Some renaming might fine, but it might be worth pointing out that the only real difference between randomized_svd and randomized_pca is that randomized_pca computes the SVD of the centered x matrix, without explicitly needing to center it. So basically, if you pre-centered x, the results would be exactly the same from randomized_svd as from randomized_pca. And since centering dense matrices is cheap, you'd only ever do this on sparse matrices.

PCA is not really a decomposition

What I meant is that it's not a decomposition (or factorization) like SVD is: It does not decomposes your matrix into a product of other matrices. It just computes a projection matrix. But you're right that it decomposes the variance.

I 100% agree with you about the difference between randomized_svd and randomized_pca, hence my suggestion about the naming, and the docstring explaining all this.

NicolasHug · 2019-02-15T16:31:25Z

sklearn/utils/extmath.py

+        means, _ = mean_variance_axis(A, axis=0)
+    else:
+        means = np.mean(A, axis=0)
+    c = np.atleast_2d(means)


call this means instead of c

NicolasHug · 2019-02-15T16:44:59Z

sklearn/decomposition/pca.py

@@ -519,7 +528,7 @@ def _fit_truncated(self, X, n_components, svd_solver):

        elif svd_solver == 'randomized':
            # sign flipping is done inside
-            U, S, V = randomized_svd(X, n_components=n_components,
+            U, S, V = randomized_pca(X, n_components=n_components,


I think we should play it safe here and only call randomized_pca when X is sparse. Who knows, there might be bugs in randomized_pca, and randomized_svd is probably faster for dense matrices.

NicolasHug · 2019-02-15T16:56:04Z

sklearn/utils/extmath.py

+        means = np.mean(A, axis=0)
+    c = np.atleast_2d(means)
+
+    if n_samples >= n_features:


This section

Q = random_state.normal( size=(n_features, n_components + n_oversamples) ) if A.dtype.kind == "f": Q = Q.astype(A.dtype, copy=False)

should be above the if / else blocks.

You probably put more thoughts into this than me, but are you sure there's no way to factorize these 2 cases (n_samples >= n_features and n_samples < n_features) into a single one, like in randomized_svd? My hope is that if we can factorize these, then maybe we can extend randomized_svd instead of having a whole new function that does almost the same thing.

I haven't found a good way to refactor the 2 use cases (not for lack of trying) is that the math changes. In one case, we compute the mean of x on axis 0 and the other case on axis 1. Also, the order of the power iteration operations change as well. Maybe there's some algebra you could do to rearrange things, but I'm not smart enough for that. I've thought about refactoring it in some way, but anything I come up with requires a whole lot of ifs which will just make things unreadable.

NicolasHug · 2019-02-15T19:14:16Z

sklearn/utils/extmath.py

+
+        Q, _ = linalg.qr(Q, mode="economic")
+
+        QA = safe_sparse_dot(A.T, Q) - \


Also please try not to deviate from the randomized_svd names because this makes comparison beetween the 2 methods more complicated than it could be.

QA.T should be B, R should be Uhat (or Vhat), and the matrix A should be M.

NicolasHug

Reviewed tests

NicolasHug · 2019-02-15T19:32:08Z

sklearn/decomposition/tests/test_pca.py

+    pca = PCA(n_components=2, svd_solver='full', random_state=rng).fit(X)
+    apca = PCA(n_components=2, svd_solver='arpack', random_state=rng).fit(X)
+    # Increase the number of power iterations to get greater accuracy in tests
+    rpca = PCA(n_components=2, svd_solver='randomized', iterated_power=40,


This whole test_singular_values() test should not be changed IMO.

If you have to set iterated_power=40 for the test to pass now, that's a regression, and that confirms my suggestion: randomized_pca should only be called for sparse matrices. Dense matrices should still use randomized_svd.

Also in general, please avoid style fixes that are not related to the issue (like spaces between **, etc). I know you mean well but then it makes it harder to determine what changes are important during the review.

NicolasHug · 2019-02-15T19:32:50Z

sklearn/decomposition/tests/test_pca.py

+    n_samples = 100
+    n_features = 80
+
+    X = rng.binomial(1, 0.1, (n_samples, n_features))


nitpick: use size=

NicolasHug · 2019-02-15T19:37:22Z

sklearn/decomposition/tests/test_pca.py

    X = np.random.RandomState(0).rand(5, 4)
    X = sp.sparse.csr_matrix(X)
-    assert(sp.sparse.issparse(X))

    pca = PCA(n_components=3, svd_solver=svd_solver)


Suggested change

pca = PCA(n_components=3, svd_solver=svd_solver)

pca = PCA(svd_solver=svd_solver)

NicolasHug · 2019-02-15T19:38:29Z

sklearn/decomposition/tests/test_pca.py

+    X = np.random.RandomState(0).rand(5, 4)
+    X = sp.sparse.csr_matrix(X)
+
+    pca = PCA(n_components=3, svd_solver='auto')


Suggested change

pca = PCA(n_components=3, svd_solver='auto')

pca = PCA(svd_solver='auto')

NicolasHug · 2019-02-15T19:55:29Z

sklearn/utils/tests/test_extmath.py

+
+@pytest.mark.parametrize('dtype',
+                         (np.int32, np.int64, np.float32, np.float64))
+def test_randomized_pca_low_rank(dtype):


Please consider integrating these 3 tests into the original ones from randomize_svd. I think that parametrizing the tests with either randomized_svd and randomized_pca, and centering X accordingly could do it pretty easily.

I'm concerned about so much duplicated code.

NicolasHug

Reviewed tests

NicolasHug · 2019-05-08T13:19:34Z

I feel like this PR could benefit from using _eigen_decompose_covariance that was introduced in #13350

dkobak · 2019-05-29T23:02:14Z

I am wondering what the status of this PR is? It would be great to see it merged. It's almost an embarrassment that scikit does not have randomized PCA for sparse matrices!

jnothman · 2019-05-30T02:10:20Z

@NicolasHug's comments imply that it is awaiting some work from @pavlin-policar. @pavlin-policar, would you rather someone else take over and complete this work?

pavlin-policar · 2019-05-30T07:34:15Z

Sorry about that, I've been quite busy for the past couple of months and to be quite honest, this wasn't high on my priorities. @NicolasHug suggested several minor things, which would be quick to fix, but this

This whole test_singular_values() test should not be changed IMO.

If you have to set iterated_power=40 for the test to pass now, that's a regression, and that confirms my suggestion: randomized_pca should only be called for sparse matrices. Dense matrices should still use randomized_svd.

is worrying and might suggest a bug, or it might not. Maybe this algorithm simply requires more iterations to converge (which seems plausible as well). Either way, I need to look into this a bit more.

If somebody would be willing to take over this, that would be great as well. I likely won't have a lot of time to dedicate for this for another month. I think the only major thing left remaining is the iterations I outlined above.

pavlin-policar · 2019-05-30T07:59:30Z

I feel like this PR could benefit from using _eigen_decompose_covariance

From what I can tell, this line will actually form the dense covariance matrix.

One way to compute PCA is by computing the eigenvectors of the covariance matrix, but that is generally very inefficient because it involves forming the covariance matrix.

Instead, by doing a bit of algebra we can compute the eigenvectors of the covariance matrix from x alone (never needing to actually form cov(x, x)!), which is what scikit-learn has done via randomized_svd.

So I don't think this is applicable here.

NicolasHug · 2019-05-30T12:20:45Z

It's almost an embarrassment that scikit does not have randomized PCA for sparse matrices!

@dkobak , there are better ways to indicate your interest in a feature. Taking over this PR would be the best one. Thanks.

dkobak · 2019-05-30T15:13:38Z

@NicolasHug Apologies if my comment came across as snarky, that was not my intention. But I was genuinely surprised that scikit does not have this feature, given that it has randomized SVD implemented and the math to add "implicit" centering is quite straightforward (as Pavlin rightly noticed). Anyway, what I really meant is that it'd be really great to see this implemented. And I think Pavlin is doing a great job here.

lobpcg · 2019-07-02T14:53:07Z

#12319 would be a (faster) alternative to 'randomized' for sparse matrices that also allows running matrix-free SVD, i.e. on a matrix, given by a function. In particular, performing implicit data centering, without changing the data. For that matter, the ARPACK SVD solver is also matrix-free by design.

Thus, one should really make the implicit data centering to be solver agnostic, allowing any matrix-free SVD solver to be used, rather than concentrating just on 'randomized'

glemaitre · 2024-05-17T17:26:32Z

Nowadays, we have the arpack solver that handle the sparse case in PCA so let's close this PR.

glemaitre · 2024-05-17T17:26:38Z

cf.https://scikit-learn.org/dev/auto_examples/release_highlights/plot_release_highlights_1_4_0.html#improved-memory-and-runtime-efficiency-for-pca-on-sparse-data

pavlin-policar mentioned this pull request Dec 20, 2018

PCA on sparse, noncentered data #12794

Closed

pavlin-policar force-pushed the pca-sparse branch 2 times, most recently from b5964c1 to aa0b093 Compare December 20, 2018 13:22

pavlin-policar changed the title ~~Implement randomized PCA~~ [MRG] Implement randomized PCA Dec 20, 2018

jeremiedbb reviewed Dec 20, 2018

View reviewed changes

pavlin-policar force-pushed the pca-sparse branch 6 times, most recently from 201ff7d to fec47b6 Compare December 21, 2018 09:24

jeremiedbb reviewed Dec 21, 2018

View reviewed changes

pavlin-policar added 2 commits February 1, 2019 11:18

Add ddof parameter to sparsefuncs variance

633e3ca

Add randomized solver for PCA that does not require centering

fe3d3a8

pavlin-policar force-pushed the pca-sparse branch from fec47b6 to fe3d3a8 Compare February 1, 2019 10:20

pavlin-policar added 2 commits February 1, 2019 11:24

Make randomized SVD use power iteration normalization utility function

2b6c217

sparseutils mean_var_axis check for ddof after axis checks

46c2540

pavlin-policar force-pushed the pca-sparse branch 2 times, most recently from f39254e to f453f5b Compare February 1, 2019 11:16

Apply pylint fixes in extmath to randomized_pca

99c1b1d

pavlin-policar force-pushed the pca-sparse branch from f453f5b to 99c1b1d 8000 Compare February 1, 2019 11:26

Koncopd mentioned this pull request Feb 4, 2019

Pca for sparse data scverse/scanpy#403

Closed

pavlin-policar added 2 commits February 4, 2019 23:31

randomized_pca: Use mean_var_axis instead of matrix.mean

3b3a483

sparseutils: Properly handle large ddof in mean_variance_axis

33574d1

pavlin-policar force-pushed the pca-sparse branch from 91e9596 to 33574d1 Compare February 4, 2019 22:33

pavlin-policar mentioned this pull request Feb 4, 2019

[ENH] PCA: Preserve f32s & reduce memory footprint when computing means biolab/orange3#3582

Merged

3 tasks

pavlin-policar added 2 commits February 11, 2019 14:44

Add tests for extmath.randomized_pca

de5726f

Update PCA docstrings to describe sparse data

d18a339

NicolasHug reviewed Feb 12, 2019

View reviewed changes

Address NicolasHugs comments on docstrings and atol in tests

473b940

pavlin-policar force-pushed the pca-sparse branch from e0d4180 to 473b940 Compare February 13, 2019 12:40

NicolasHug reviewed Feb 13, 2019

View reviewed changes

sklearn/decomposition/tests/test_pca.py Outdated Show resolved Hide resolved

Fix error tolerance in tests for pca

16a3c8e

NicolasHug reviewed Feb 15, 2019

View reviewed changes

github-actions bot added module:decomposition module:utils labels Mar 2, 2020

ivirshup mentioned this pull request Mar 3, 2020

PCA for sparse data (v2) scverse/scanpy#1066

Closed

cmarmo added the Needs Decision Requires decision label Oct 22, 2020

ivirshup mentioned this pull request Oct 27, 2020

ENH Allow fitting PCA on sparse X with arpack solvers #18689

Merged

2 tasks

Base automatically changed from master to main January 22, 2021 10:50

thomasjpfan added the cython label Apr 13, 2021

glemaitre closed this May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Implement randomized PCA #12841

[MRG] Implement randomized PCA #12841

		power_iteration_normalizer)


		def randomized_pca(A, n_components, n_oversamples=10, n_iter="auto",


		Q, _ = linalg.qr(Q, mode="economic")

		QA = safe_sparse_dot(A.T, Q) - \

	pca = PCA(n_components=3, svd_solver=svd_solver)
	pca = PCA(svd_solver=svd_solver)

	pca = PCA(n_components=3, svd_solver='auto')
	pca = PCA(svd_solver='auto')

[MRG] Implement randomized PCA #12841

[MRG] Implement randomized PCA #12841

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment