-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH Allow fitting PCA on sparse X with arpack solvers #18689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently, `.fit` and `.fit_transform` work. Need to fix `.transform`.
@ivirshup I'd be interested in taking this over and also adding support for ARPACK/PROPACK/LOBPCG and the randomized SVD. I think all of the above accept a |
Proof of concept showing that at least the three solvers supported by |
@andportnoy thanks! I believe the other solvers should work as well. I don't think we looked into their reproducibility though. I'm happy to give this over, but was mainly paused on it since I haven't heard feedback from the sklearn team. There are also a few other PRs open on this repo which should implement something similar. |
@ivirshup Cool, I'll start pushing on this. I like the
What do you mean by reproducibility?
There's #12841, do you know of any others? #12841 has some test cases that could be reused. |
Basically all the extra little checks we did during the PR to scanpy. This was mostly "is the random state working right", "is memory behaving as expected", etc. |
@ivirshup Do you remember why you only enabled ARPACK here and not the RandomizedSVD? Let me know if the below sounds familiar. I'm running into an issue with
However Seems like a Relevant StackOverflow: https://stackoverflow.com/questions/67434966/why-scipy-sparse-linalg-linearoperator-has-different-behaviors-with-np-dot-an. Source of NumPy error message: https://github.com/numpy/numpy/blob/b65f0b7b8ba7e80b65773e06aae22a8369678868/numpy/core/src/umath/ufunc_object.c#L1692-L1702. |
Did you find a related issue or is this a documented limitation in scipy? If not, let's open an issue on the scipy issue tracker to discuss possible improvements in either numpy or scipy. If it cannot be supported, the error message could at least be improved to be more user-friendly. But computing |
@andportnoy sorry for missing this! I believe this it's more that the scipy solvers explicitly support linear operators, while I'm not sure that the random solver does. |
@ogrisel I went through all SciPy and NumPy issues that mention |
Cool that a more general thing is moving forward! If it's a superset, I'd suggest that it could be nice to merge the |
Got it. I'm happy to prioritize arpack/lobpcg. The main issue so far has been numerical correctness in comparison testing. Would really appreciate if you and/or @jjerphan could take a look at the recent updates in the original issue. The code I wrote for the LinearOperator wrapper ended up pretty much identical to what you have in this PR. I think it would make sense to incorporate your commits in my branch (by rebasing probably) so that your original push for this feature is not lost. |
@andportnoy, I would want to reuse your tests here to (a) not have to do something different, (b) not conflict with your PR, so would add you as co-author to this PR. |
LOBPCG has been the best behaved solver of the bunch, I would argue #24415 is already mergeable if we restricted it to LOBPCG. See these plots in particular: #12794 (comment). |
@andportnoy: I would encourage that you synchronise with @ivirshup regarding the intersection of your two PRs to be co-authors of this one. On what remains of your contribution, I would also encourage extracting orthogonal changes in dedicated PR if possible. Thank you! |
@ivirshup How about I prepare a single commit with the tests that you could cherry pick into this PR? I'll pause my work on my own PR until you get your ARPACK changes in, then will continue with the other solvers. |
@ivirshup Try this on your branch: git remote add -f -t pca-sparse-test andrey git@github.com:andportnoy/scikit-learn
git cherry-pick andrey/pca-sparse-test |
Co-authored-by: Andrey Portnoy <aportnoy@fastmail.com>
Great! Thanks! I've added this. I'm going to read up on the accuracy discussion from the github issue, then get back to you. At the moment, I'm curious if there is a certain matrix size under which it should just be densified to work around the precision problems. |
This might be a good idea. At least with one solver I've seen that the more the matrix is "determined" (i.e. as the ratio |
Making CI green should have highest priority here. |
Yeah, I think the conservative option here would be to leave I believe the main thing left to address is some remaining tolerance issues and test time, which I have a question on (#18689 (comment)) Am I missing other things? |
Hi @ivirshup, We have discussed type annotations on Monday and we agreed that it is probably not worth maintaining some for scikit-learn (I think the consensus is that it would come with maintainance cost for having consistency between the input parameters' validation, the documentation, and the type annotations). Could you remove the ones you have introduced? Once this is done, I think we can merge this PR. |
Ah, I missed the one I had left! Unless there were other annotations I'm not seeing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo the resolution of the last open threads.
I cannot see any annotation left.
Thank you for this qualitative contribution, @ivirshup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that all review comments have been addressed. I reran:
SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest sklearn/decomposition/tests/test_pca.py -n auto -k test_pca_sparse
locally and everything is green.
Merged, thanks for the PR @ivirshup! Lookng forward to the follow up for the other solvers! |
Awesome! Super happy to see this in! |
…8689) Co-authored-by: Andrey Portnoy <aportnoy@fastmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
What does this implement/fix? Explain your changes.
The current
PCA
transformer cannot handle sparse input, as mean centering the data would densify it. This PR uses implicit mean centering to allow fitting and transforming sparse data without densifying the whole data matrix.This can be a huge performance improvement. For an example, I'll use computing a PCA on
20newsgroups
example_code
Run via
mprof run prof_pca.py
This takes a fraction of the time and memory.
Any other comments?
This is still a work in progress. A few questions I had
sparsefuncs
TODO