10000 PCA supports sparse now, docs suggest otherwise. · Issue #28406 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

PCA supports sparse now, docs suggest otherwise. #28406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
koaning opened this issue Feb 12, 2024 · 4 comments · Fixed by #28498
Closed

PCA supports sparse now, docs suggest otherwise. #28406

koaning opened this issue Feb 12, 2024 · 4 comments · Fixed by #28498

Comments

@koaning
Copy link
koaning commented Feb 12, 2024

Describe the issue linked to the documentation

The latest release notes for 1.4 say the following about PCA.

Feature decomposition.PCA now supports scipy.sparse.sparray and scipy.sparse.spmatrix inputs when using the arpack solver. When used on sparse data like datasets.fetch_20newsgroups_vectorized this can lead to speed-ups of 100x (single threaded) and 70x lower memory usage. Based on Alexander Tarashansky’s implementation in scanpy. #18689 by Isaac Virshup and Andrey Portnoy.

However, once you go to the PCA docs it still says this.

Notice that this class does not support sparse input. See TruncatedSVD for an alternative with sparse data.

Suggest a potential alternative/fix

I guess that one sentences can just be removed now? I can whip up a PR if folks agree.

@koaning koaning added Documentation Needs Triage Issue requires triage labels Feb 12, 2024
@glemaitre
Copy link
Member

Indeed, we should correct the documentation.

Another PR to i 8000 mprove the situation is to change the behaviour of "auto" that leads to:

TypeError: PCA only support sparse inputs with the "arpack" solver, while "auto" was passed. See TruncatedSVD for a possible alternative.

In a new PR, we should instead select the "arpack" solver and do not raise any error.

@lamdang2k
Copy link
Contributor

Indeed, we should correct the documentation.

Another PR to improve the situation is to change the behaviour of "auto" that leads to:

TypeError: PCA only support sparse inputs with the "arpack" solver, while "auto" was passed. See TruncatedSVD for a possible alternative.

In a new PR, we should instead select the "arpack" solver and do not raise any error.

I read the documentation and understand that "auto" will become "randomized" or "full" depending on X.shape and n_components. Should I replace "full" with "arpack" to get rid of error raising?

@Charlie-XIAO
Copy link
Contributor
Charlie-XIAO commented Feb 21, 2024
8000

You may understand better if you look directly at the code.

if self._fit_svd_solver == "auto":
# Small problem or n_components == 'mle', just call full PCA
if max(X.shape) <= 500 or n_components == "mle":
self._fit_svd_solver = "full"
elif 1 <= n_components < 0.8 * min(X.shape):
self._fit_svd_solver = "randomized"
# This is also the case of n_components in (0,1)
else:
self._fit_svd_solver = "full"

My understanding is that if the input is sparse you directly select "arpack"; other things are left unchanged.

By the way you may need to refactor a bit to take care of the n_components as there is a special case for arpack.

@lamdang2k
Copy link
Contributor

You may understand better if you look directly at the code.

if self._fit_svd_solver == "auto":
# Small problem or n_components == 'mle', just call full PCA
if max(X.shape) <= 500 or n_components == "mle":
self._fit_svd_solver = "full"
elif 1 <= n_components < 0.8 * min(X.shape):
self._fit_svd_solver = "randomized"
# This is also the case of n_components in (0,1)
else:
self._fit_svd_solver = "full"

My understanding is that if the input is sparse you directly select "arpack"; other things are left unchanged.

By the way you may need to refactor a bit to take care of the n_components as there is a special case for arpack.

Thanks for your clarification. I will do a PR on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@koaning @adrinjalali @glemaitre @lamdang2k @Charlie-XIAO and others
0