8000 ENH Allow fitting PCA on sparse X with arpack solvers by ivirshup · Pull Request #18689 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Allow fitting PCA on sparse X with arpack solvers #18689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Nov 7, 2023

Conversation

ivirshup
Copy link
Contributor
@ivirshup ivirshup commented Oct 27, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The current PCA transformer cannot handle sparse input, as mean centering the data would densify it. This PR uses implicit mean centering to allow fitting and transforming sparse data without densifying the whole data matrix.

This can be a huge performance improvement. For an example, I'll use computing a PCA on 20newsgroups

example_code
# prof_pca.py
import numpy as np
from scipy import sparse

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.decomposition import PCA


@profile
def implicit_mean_pca(X: sparse.spmatrix):
    pca = PCA(n_components=100, svd_solver="arpack")
    coords = pca.fit_transform(X)
    return pca, coords


@profile
def explicit_mean_pca(X: sparse.spmatrix):
    X = X.toarray()
    pca = PCA(n_components=100, svd_solver="arpack")
    coords = pca.fit_transform(X)
    return pca, coords


if __name__ == "__main__":
    X, _ = fetch_20newsgroups_vectorized(return_X_y=True)

    spca, scoords = implicit_mean_pca(X)

    dpca, dcoords = explicit_mean_pca(X)

    assert np.allclose(spca.components_, dpca.components_)
    assert np.allclose(spca.explained_variance_, dpca.explained_variance_)
    assert np.allclose(spca.singular_values_, dpca.singular_values_)
    assert np.allclose(spca.transform(X), dpca.transform(X))
    assert np.allclose(spca.transform(X.toarray()), dpca.transform(X.toarray()))

Run via mprof run prof_pca.py

sparse_pca_mem-usage

This takes a fraction of the time and memory.

Any other comments?

This is still a work in progress. A few questions I had

  • Should implicit mean centering go into sparsefuncs
  • This makes the flow control in the PCA code more complicated. At what point should this get refactored? Is this out of scope for this PR?
  • What kinds of tests would you like to see?
    • Mind if I just crib these from [MRG] Implement randomized PCA #12841?
    • Ideally most of the tests for PCA could be parameterized by input matrix type, but I think this is out of scope for this PR.

TODO

  • what's new entry
  • test implicit centering linear operator and move to sparsefuncs

Currently, `.fit` and `.fit_transform` work. Need to fix `.transform`.
Base automatically changed from master to main January 22, 2021 10:53
@andportnoy
Copy link
Contributor

@ivirshup I'd be interested in taking this over and also adding support for ARPACK/PROPACK/LOBPCG and the randomized SVD. I think all of the above accept a LinearOperator.

@andportnoy
Copy link
Contributor

Proof of concept showing that at least the three solvers supported by scipy.sparse.linalg.svds are compatible with LinearOperator (see table at the bottom): https://gist.github.com/andportnoy/03c70436a8b830f90e99ab22640057fb

@ivirshup
Copy link
Contributor Author

@andportnoy thanks!

I believe the other solvers should work as well. I don't think we looked into their reproducibility though.

I'm happy to give this over, but was mainly paused on it since I haven't heard feedback from the sklearn team. There are also a few other PRs open on this repo which should implement something similar.

@andportnoy
8000 Copy link
Contributor

@ivirshup Cool, I'll start pushing on this. I like the LinearOperator approach because it doesn't involve creating extra special case logic for different solvers.

I don't think we looked into their reproducibility though.

What do you mean by reproducibility?

There are also a few other PRs open on this repo which should implement something similar.

There's #12841, do you know of any others? #12841 has some test cases that could be reused.

@ivirshup
Copy link
Contributor Author

What do you mean by reproducibility?

Basically all the extra little checks we did during the PR to scanpy. This was mostly "is the random state working right", "is memory behaving as expected", etc.

@andportnoy
Copy link
Contributor

@ivirshup Do you remember why you only enabled ARPACK here and not the RandomizedSVD? Let me know if the below sounds familiar.

I'm running into an issue with randomized_svd (inside safe_sparse_dot) where a @ b (b is a LinearOperator) fails with error

ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)

However b.T @ a.T does work.

Seems like a LinearOperator <-> NumPy compatibility issue.

Relevant StackOverflow: https://stackoverflow.com/questions/67434966/why-scipy-sparse-linalg-linearoperator-has-different-behaviors-with-np-dot-an.

Source of NumPy error message: https://github.com/numpy/numpy/blob/b65f0b7b8ba7e80b65773e06aae22a8369678868/numpy/core/src/umath/ufunc_object.c#L1692-L1702.

@ogrisel 8000
Copy link
Member
ogrisel commented Oct 22, 2022

Seems like a LinearOperator <-> NumPy compatibility issue.

Did you find a related issue or is this a documented limitation in scipy? If not, let's open an issue on the scipy issue tracker to discuss possible improvements in either numpy or scipy. If it cannot be supported, the error message could at least be improved to be more user-friendly.

But computing (b.T @ a.T).T seems like a good workaround.

@ivirshup
Copy link
Contributor Author

@andportnoy sorry for missing this!

I believe this it's more that the scipy solvers explicitly support linear operators, while I'm not sure that the random solver does.

@andportnoy
Copy link
Contributor

@ogrisel I went through all SciPy and NumPy issues that mention LinearOperator but have not found anything directly relevant, so I opened scipy/scipy#17281.

@jjerphan jjerphan self-requested a review May 23, 2023 22:14
@andportnoy
Copy link
Contributor

Just in case, I'm working on a superset of this change in #24415. The feature itself is done, going through numerical debug at the moment.

Progress is tracked in the original issue: #12794.

@ivirshup
Copy link
Contributor Author

Cool that a more general thing is moving forward!

If it's a superset, I'd suggest that it could be nice to merge the arpack (and possibly lobpcg) parts first, to make reviewing easier. I've also got @jjerphan's ear in a sprint at the moment, so can poke for review 😉

@andportnoy
Copy link
Contributor

Got it. I'm happy to prioritize arpack/lobpcg. The main issue so far has been numerical correctness in comparison testing. Would really appreciate if you and/or @jjerphan could take a look at the recent updates in the original issue.

The code I wrote for the LinearOperator wrapper ended up pretty much identical to what you have in this PR. I think it would make sense to incorporate your commits in my branch (by rebasing probably) so that your original push for this feature is not lost.

@ivirshup
Copy link
Contributor Author

@andportnoy, I would want to reuse your tests here to (a) not have to do something different, (b) not conflict with your PR, so would add you as co-author to this PR.

@andportnoy
Copy link
Contributor

LOBPCG has been the best behaved solver of the bunch, I would argue #24415 is already mergeable if we restricted it to LOBPCG.

See these plots in particular: #12794 (comment).

@jjerphan
Copy link
Member

@andportnoy: I would encourage that you synchronise with @ivirshup regarding the intersection of your two PRs to be co-authors of this one.

On what remains of your contribution, I would also encourage extracting orthogonal changes in dedicated PR if possible. Thank you!

@andportnoy
Copy link
Contributor

@ivirshup How about I prepare a single commit with the tests that you could cherry pick into this PR? I'll pause my work on my own PR until you get your ARPACK changes in, then will continue with the other solvers.

@andportnoy
Copy link
Contributor

@ivirshup Try this on your branch:

git remote add -f -t pca-sparse-test andrey git@github.com:andportnoy/scikit-learn
git cherry-pick andrey/pca-sparse-test

Co-authored-by: Andrey Portnoy <aportnoy@fastmail.com>
@ivirshup
Copy link
Contributor Author

Great! Thanks! I've added this.

I'm going to read up on the accuracy discussion from the github issue, then get back to you.

At the moment, I'm curious if there is a certain matrix size under which it should just be densified to work around the precision problems.

@andportnoy
Copy link
Contributor

At the moment, I'm curious if there is a certain matrix size under which it should just be densified to work around the precision problems.

This might be a good idea. At least with one solver I've seen that the more the matrix is "determined" (i.e. as the ratio m/n increases), the higher the accuracy: #12794 (comment).

@lorentzenchr
Copy link
Member

Making CI green should have highest priority here.

@ivirshup
Copy link
Contributor Author
ivirshup commented Oct 8, 2023

@ogrisel

Should solver="auto" default to "arpack" when the matrix is sparse now?

I think so, although we might also want to add implicit centering support for the svd_solver="randomized" case, no?

Yeah, I think the conservative option here would be to leave "auto" as is for now, see if other alternatives for PCA on sparse data get merged, compare across those solutions, then pick what auto should do for sparse data.


I believe the main thing left to address is some remaining tolerance issues and test time, which I have a question on (#18689 (comment))

Am I missing other things?

@jjerphan
Copy link
Member
jjerphan commented Nov 1, 2023

Hi @ivirshup,

We have discussed type annotations on Monday and we agreed that it is probably not worth maintaining some for scikit-learn (I think the consensus is that it would come with maintainance cost for having consistency between the input parameters' validation, the documentation, and the type annotations).

Could you remove the ones you have introduced?

Once this is done, I think we can merge this PR.

@ivirshup
Copy link 10000
Contributor Author
ivirshup commented Nov 2, 2023

Could you remove the ones you have introduced?

Ah, I missed the one I had left! Unless there were other annotations I'm not seeing?

Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo the resolution of the last open threads.
I cannot see any annotation left.

Thank you for this qualitative contribution, @ivirshup.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that all review comments have been addressed. I reran:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest sklearn/decomposition/tests/test_pca.py -n auto -k test_pca_sparse

locally and everything is green.

@ogrisel ogrisel enabled auto-merge (squash) November 7, 2023 09:40
@ogrisel ogrisel merged commit 2d9fa48 into scikit-learn:main Nov 7, 2023
@ogrisel
Copy link
Member
ogrisel commented Nov 7, 2023

Merged, thanks for the PR @ivirshup! Lookng forward to the follow up for the other solvers!

@ivirshup
Copy link
Contributor Author
ivirshup commented Nov 7, 2023

Awesome! Super happy to see this in!

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…8689)

Co-authored-by: Andrey Portnoy <aportnoy@fastmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0