ENH Allow fitting PCA on sparse X with arpack solvers #18689

ivirshup · 2020-10-27T03:56:12Z

Reference Issues/PRs

Partial fix for PCA on sparse, noncentered data #12794 (only works for scipy solvers)
This implements @atarashansky's solution from PCA on sparse, noncentered data #12794 (comment)
This upstreams the implementation contributed to scanpy PCA for sparse data (v2) scverse/scanpy#1066

What does this implement/fix? Explain your changes.

The current PCA transformer cannot handle sparse input, as mean centering the data would densify it. This PR uses implicit mean centering to allow fitting and transforming sparse data without densifying the whole data matrix.

This can be a huge performance improvement. For an example, I'll use computing a PCA on 20newsgroups

example_code

# prof_pca.py
import numpy as np
from scipy import sparse

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.decomposition import PCA


@profile
def implicit_mean_pca(X: sparse.spmatrix):
    pca = PCA(n_components=100, svd_solver="arpack")
    coords = pca.fit_transform(X)
    return pca, coords


@profile
def explicit_mean_pca(X: sparse.spmatrix):
    X = X.toarray()
    pca = PCA(n_components=100, svd_solver="arpack")
    coords = pca.fit_transform(X)
    return pca, coords


if __name__ == "__main__":
    X, _ = fetch_20newsgroups_vectorized(return_X_y=True)

    spca, scoords = implicit_mean_pca(X)

    dpca, dcoords = explicit_mean_pca(X)

    assert np.allclose(spca.components_, dpca.components_)
    assert np.allclose(spca.explained_variance_, dpca.explained_variance_)
    assert np.allclose(spca.singular_values_, dpca.singular_values_)
    assert np.allclose(spca.transform(X), dpca.transform(X))
    assert np.allclose(spca.transform(X.toarray()), dpca.transform(X.toarray()))

Run via mprof run prof_pca.py

This takes a fraction of the time and memory.

Any other comments?

This is still a work in progress. A few questions I had

Should implicit mean centering go into sparsefuncs
This makes the flow control in the PCA code more complicated. At what point should this get refactored? Is this out of scope for this PR?
What kinds of tests would you like to see?
- Mind if I just crib these from [MRG] Implement randomized PCA #12841?
- Ideally most of the tests for PCA could be parameterized by input matrix type, but I think this is out of scope for this PR.

TODO

what's new entry
test implicit centering linear operator and move to sparsefuncs

Currently, `.fit` and `.fit_transform` work. Need to fix `.transform`.

andportnoy · 2022-08-27T22:30:10Z

@ivirshup I'd be interested in taking this over and also adding support for ARPACK/PROPACK/LOBPCG and the randomized SVD. I think all of the above accept a LinearOperator.

andportnoy · 2022-08-28T16:20:47Z

Proof of concept showing that at least the three solvers supported by scipy.sparse.linalg.svds are compatible with LinearOperator (see table at the bottom): https://gist.github.com/andportnoy/03c70436a8b830f90e99ab22640057fb

ivirshup · 2022-08-29T11:53:41Z

@andportnoy thanks!

I believe the other solvers should work as well. I don't think we looked into their reproducibility though.

I'm happy to give this over, but was mainly paused on it since I haven't heard feedback from the sklearn team. There are also a few other PRs open on this repo which should implement something similar.

andportnoy · 2022-08-29T12:46:37Z

@ivirshup Cool, I'll start pushing on this. I like the LinearOperator approach because it doesn't involve creating extra special case logic for different solvers.

I don't think we looked into their reproducibility though.

What do you mean by reproducibility?

There are also a few other PRs open on this repo which should implement something similar.

There's #12841, do you know of any others? #12841 has some test cases that could be reused.

ivirshup · 2022-08-29T13:59:37Z

What do you mean by reproducibility?

Basically all the extra little checks we did during the PR to scanpy. This was mostly "is the random state working right", "is memory behaving as expected", etc.

andportnoy · 2022-09-11T00:49:00Z

@ivirshup Do you remember why you only enabled ARPACK here and not the RandomizedSVD? Let me know if the below sounds familiar.

I'm running into an issue with randomized_svd (inside safe_sparse_dot) where a @ b (b is a LinearOperator) fails with error

ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)

However b.T @ a.T does work.

Seems like a LinearOperator <-> NumPy compatibility issue.

Relevant StackOverflow: https://stackoverflow.com/questions/67434966/why-scipy-sparse-linalg-linearoperator-has-different-behaviors-with-np-dot-an.

Source of NumPy error message: https://github.com/numpy/numpy/blob/b65f0b7b8ba7e80b65773e06aae22a8369678868/numpy/core/src/umath/ufunc_object.c#L1692-L1702.

ogrisel · 2022-10-22T19:52:51Z

Seems like a LinearOperator <-> NumPy compatibility issue.

Did you find a related issue or is this a documented limitation in scipy? If not, let's open an issue on the scipy issue tracker to discuss possible improvements in either numpy or scipy. If it cannot be supported, the error message could at least be improved to be more user-friendly.

But computing (b.T @ a.T).T seems like a good workaround.

ivirshup · 2022-10-23T12:49:15Z

@andportnoy sorry for missing this!

I believe this it's more that the scipy solvers explicitly support linear operators, while I'm not sure that the random solver does.

andportnoy · 2022-10-23T17:50:42Z

@ogrisel I went through all SciPy and NumPy issues that mention LinearOperator but have not found anything directly relevant, so I opened scipy/scipy#17281.

andportnoy · 2023-05-23T23:34:37Z

Just in case, I'm working on a superset of this change in #24415. The feature itself is done, going through numerical debug at the moment.

Progress is tracked in the original issue: #12794.

ivirshup · 2023-05-23T23:46:16Z

Cool that a more general thing is moving forward!

If it's a superset, I'd suggest that it could be nice to merge the arpack (and possibly lobpcg) parts first, to make reviewing easier. I've also got @jjerphan's ear in a sprint at the moment, so can poke for review 😉

andportnoy · 2023-05-24T00:07:25Z

Got it. I'm happy to prioritize arpack/lobpcg. The main issue so far has been numerical correctness in comparison testing. Would really appreciate if you and/or @jjerphan could take a look at the recent updates in the original issue.

The code I wrote for the LinearOperator wrapper ended up pretty much identical to what you have in this PR. I think it would make sense to incorporate your commits in my branch (by rebasing probably) so that your original push for this feature is not lost.

< 8000 /option>

ivirshup · 2023-05-24T00:07:45Z

@andportnoy, I would want to reuse your tests here to (a) not have to do something different, (b) not conflict with your PR, so would add you as co-author to this PR.

andportnoy · 2023-05-24T00:20:41Z

LOBPCG has been the best behaved solver of the bunch, I would argue #24415 is already mergeable if we restricted it to LOBPCG.

See these plots in particular: #12794 (comment).

jjerphan · 2023-05-24T12:11:21Z

@andportnoy: I would encourage that you synchronise with @ivirshup regarding the intersection of your two PRs to be co-authors of this one.

On what remains of your contribution, I would also encourage extracting orthogonal changes in dedicated PR if possible. Thank you!

andportnoy · 2023-05-24T17:05:47Z

@ivirshup How about I prepare a single commit with the tests that you could cherry pick into this PR? I'll pause my work on my own PR until you get your ARPACK changes in, then will continue with the other solvers.

andportnoy · 2023-05-24T17:35:47Z

@ivirshup Try this on your branch:

git remote add -f -t pca-sparse-test andrey git@github.com:andportnoy/scikit-learn
git cherry-pick andrey/pca-sparse-test

Co-authored-by: Andrey Portnoy <aportnoy@fastmail.com>

ivirshup · 2023-05-24T18:50:11Z

Great! Thanks! I've added this.

I'm going to read up on the accuracy discussion from the github issue, then get back to you.

At the moment, I'm curious if there is a certain matrix size under which it should just be densified to work around the precision problems.

andportnoy · 2023-05-24T20:43:31Z

At the moment, I'm curious if there is a certain matrix size under which it should just be densified to work around the precision problems.

This might be a good idea. At least with one solver I've seen that the more the matrix is "determined" (i.e. as the ratio m/n increases), the higher the accuracy: #12794 (comment).

sklearn/decomposition/_pca.py

doc/whats_new/v1.4.rst

sklearn/decomposition/tests/test_pca.py

sklearn/utils/tests/test_sparsefuncs.py

lorentzenchr · 2023-10-07T08:33:23Z

Making CI green should have highest priority here.

ivirshup · 2023-10-08T15:09:10Z

@ogrisel

Should solver="auto" default to "arpack" when the matrix is sparse now?

I think so, although we might also want to add implicit centering support for the svd_solver="randomized" case, no?

Yeah, I think the conservative option here would be to leave "auto" as is for now, see if other alternatives for PCA on sparse data get merged, compare across those solutions, then pick what auto should do for sparse data.

I believe the main thing left to address is some remaining tolerance issues and test time, which I have a question on (#18689 (comment))

Am I missing other things?

jjerphan · 2023-11-01T12:42:53Z

Hi @ivirshup,

We have discussed type annotations on Monday and we agreed that it is probably not worth maintaining some for scikit-learn (I think the consensus is that it would come with maintainance cost for having consistency between the input parameters' validation, the documentation, and the type annotations).

Could you remove the ones you have introduced?

Once this is done, I think we can merge this PR.

ivirshup · 2023-11-02T14:27:19Z

Could you remove the ones you have introduced?

Ah, I missed the one I had left! Unless there were other annotations I'm not seeing?

jjerphan

LGTM, modulo the resolution of the last open threads.
I cannot see any annotation left.

Thank you for this qualitative contribution, @ivirshup.

ogrisel

It seems that all review comments have been addressed. I reran:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest sklearn/decomposition/tests/test_pca.py -n auto -k test_pca_sparse

locally and everything is green.

ogrisel · 2023-11-07T12:24:12Z

Merged, thanks for the PR @ivirshup! Lookng forward to the follow up for the other solvers!

ivirshup · 2023-11-07T14:47:35Z

Awesome! Super happy to see this in!

…8689) Co-authored-by: Andrey Portnoy <aportnoy@fastmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

ivirshup added 2 commits October 27, 2020 13:10

Initial commit for PCA on sparse data

df6232d

Currently, `.fit` and `.fit_transform` work. Need to fix `.transform`.

Allow pca.transform(spmatrix)

14a8954

github-actions bot added the module:decomposition label Oct 27, 2020

Base automatically changed from master to main January 22, 2021 10:53

cleaner implicit centering

ed87334

andportnoy mentioned this pull request Aug 28, 2022

PCA on sparse, noncentered data #12794

Closed

andportnoy mentioned this pull request Sep 11, 2022

[WIP] Implement PCA on sparse noncentered data #24415

Open

12 tasks

andportnoy mentioned this pull request Oct 23, 2022

BUG: using LinearOperator as RHS operand of @ causes a NumPy error scipy/scipy#17281

Closed

jjerphan self-requested a review May 23, 2023 22:14

Merge branch 'main' into pca-sparse-merge

ce97d14

Integrate tests from Andrey

f5158a1

Co-authored-by: Andrey Portnoy <aportnoy@fastmail.com>

lorentzenchr reviewed Oct 7, 2023

View reviewed changes

ivirshup added 12 commits October 8, 2023 13:26

Merge branch 'main' into pca-sparse

1bfad54

Remove deprecated attribute access

3ca40e2

Fix example usage

f10cc33

move import

0d91e1c

remove annotations import

4bcac1f

Address suggestions to test_pca

e6daeec

made change to variance calculation

bf81430

fix linting on test_sparsefuncs.py

e920da0

speed ups -> speed-ups

c7eda6a

BasePCA.transform docs updated to include support for sparse input

f9c5329

Remove type hint

e416879

Add TODO to variance calculation for PCA

a519798

Dial in tolerances

a647d8d

Remove last type annotation

1ccb9a8

jjerphan approved these changes Nov 3, 2023

View reviewed changes

Merge branch 'main' into pca-sparse

5bf6134

ogrisel approved these changes Nov 7, 2023

View reviewed changes

ogrisel enabled auto-merge (squash) November 7, 2023 09:40

ogrisel merged commit 2d9fa48 into scikit-learn:main Nov 7, 2023

koaning mentioned this pull request Feb 12, 2024

PCA supports sparse now, docs suggest otherwise. #28406

Closed

Charlie-XIAO mentioned this pull request Jul 17, 2024

ENH add ARPACK solver to IncrementalPCA to avoid densifying sparse data #29512

Open

6 tasks

flying-sheep mentioned this pull request Sep 27, 2024

Use upstream sklearn PCA if possible scverse/scanpy#3267

Merged

Uh oh!

ENH Allow fitting PCA on sparse X with arpack solvers #18689

ENH Allow fitting PCA on sparse X with arpack solvers #18689

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!