8000 Solve PCA via `np.linalg.eigh(X_centered.T @ X_centered)` instead of `np.linalg.svd(X_centered)` when `X.shape[1]` is small enough. · Issue #27483 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Solve PCA via np.linalg.eigh(X_centered.T @ X_centered) instead of np.linalg.svd(X_centered) when X.shape[1] is small enough. #27483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Sep 27, 2023 · 7 comments
Labels
Array API Enhancement Moderate Anything that requires some knowledge of conventions and best practices Performance

Comments

@ogrisel
Copy link
Member
ogrisel commented Sep 27, 2023

Describe the workflow you want to enable

Assuming that X.shape[0] >> X.shape[1] and X.shape[1] is small enough to materialize the covariance matrix X.T @ X, then using an eigensolver of the covariance matrix is much faster than the SVD of the centered data. See the proof of concept below:

Describe your proposed solution

>>> import numpy as np
>>> X = np.random.randn(int(1e6), 100)
>>> X -= X.mean(axis=0) # does not impact speed but required for PCA
>>> %time U, s, Vt = np.linalg.svd(X, full_matrices=False)
CPU times: user 22.2 s, sys: 1.59 s, total: 23.8 s
Wall time: 8.18 s
>>> %time eigenvals, eigenvecs = np.linalg.eigh(X.T @ X)
CPU times: user 81.1 ms, sys: 172 ms, total: 253 ms
Wall time: 255 ms

That's a 32x speed-up of the "full" solver. But it's also much faster than truncated randomized solver, even when n_components is quite low:

>>> from sklearn.utils.extmath import randomized_svd
>>> %time U, s, Vt = randomized_svd(X, n_components=2)
CPU times: user 2.76 s, sys: 722 ms, total: 3.48 s
Wall time: 2.45 s
>>> %time U, s, Vt = randomized_svd(X, n_components=5)
CPU times: user 3.68 s, sys: 688 ms, total: 4.36 s
Wall time: 2.67 s
>>> %time U, s, Vt = randomized_svd(X, n_components=10)
CPU times: user 3.55 s, sys: 808 ms, total: 4.35 s
Wall time: 2.81 s
>>> %time U, s, Vt = randomized_svd(X, n_components=50)
CPU times: user 16.3 s, sys: 3.77 s, total: 20.1 s
Wall time: 11.9 s

And the results are the same (up to sign flips, hence the use of np.abs):

>>> np.allclose(s ** 2, eigenvals[::-1])
True
>>> np.allclose(np.abs(eigenvecs[:, ::-1].T), np.abs(Vt))
True

Note that we might need to adapt the svd_flip helper (or even introduce eigen_flip) to do a proper deterministic sign swap for that strategy.

The SVD variant returns U which is useful for a cheap fit_transform but when X.shape[0] >> X.shape[1] and X.shape[1] is likely that fit_transform() implemented as fit().transform() will still be much faster with eigh on the covariance matrix.

Assuming we agree to add this strategy to scikit-learn, we have two options:

  • introduce a new svd_solver="covariance_eigh" option, even if we don't actually perform an SVD anymore,
  • alternatively we could make the svd_solver="full" option smart and use eigh on covariance whenever X.shape[0] >> X.shape[1].

In either cases we might want to adjust the logic of svd_solver="auto" to use this whenever it makes sense (some benchmarking by adapting benchmarks/bench_kernel_pca_solvers_time_vs_n_components.py and benchmarks/bench_kernel_pca_solvers_time_vs_n_samples.py will be needed).

Additional context

It might also be possi 8000 ble to do that for implicitly centered sparse data via a linear operator (as done in #18689) to form the covariance matrix (which is likely to be nearly dense when X.shape[0] >> X.shape[1]).

Also note that this should work with the Array API (see array_api.linalg.eigh), and assuming array-api-compat provides a dask adaptor, this would be a very efficient way to do out-of-core (and even distributed) PCA on very large datasets (assuming X.shape[1] is small enough). It might be even more efficient than IncrementalPCA.

@ogrisel ogrisel added Enhancement Performance help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Sep 27, 2023
@ogrisel
Copy link
Member Author
ogrisel commented Sep 27, 2023

I now realize that this is very likely the reason why scikit-learn-intelex (and maybe also cuML) are so much faster than scikit-learn at PCA for n_samples >> n_features.

Here is a screenshot of the profiling trace of this benchmark in the https://speedscope.app interactive viewer:

image

vs the trace for scikit-learn-intelex:

image

@Micky774
Copy link
Contributor

This sounds good. It has also been applied in FastICA. In that case, to resolve sign differences we ended up using

if self.sign_flip:
    u *= np.sign(u[0])

A nice benefit is that the memory savings are potentially even more dramatic than the compute savings.

As far as API is concerned, we resolved to add a separate solver='eigh' option that requires opt-in, and moved to include an 'auto' option which would switch between 'eigh' and 'svd' based on a heuristic (n_samples>50*n_features) in a follow-up PR (which I should probably refresh and bring up-to-date). I think a similar approach in allowing for explicit specification, i.e. your svd_solver="covariance_eigh", and updating svd_solver="auto" would be the best approach here.

@ogrisel
Copy link
Member Author
ogrisel commented Sep 28, 2023

FYI, I am giving it a shot.

@lorentzenchr
Copy link
Member

Just note that this doubles the condition number and is therefore less numerical stable (eg on matrices X with large range of singular values). This could be mentioned in the docs.

@ogrisel
Copy link
Member Author
ogrisel commented Sep 29, 2023

Just note that this doubles the condition number and is therefore less numerical stable (eg on matrices X with large range of singular values). This could be mentioned in the docs.

Thanks, I'll do that. Let's move the discussion to the PR.

@Micky774 I implemented an updated the "auto" heuristic that seems to be quite robust at picking the most efficient solver. See the benchmark results in the PR: #27491 (comment)

@Micky774
Copy link
Contributor
Micky774 commented Sep 29, 2023

@Micky774 I implemented an updated the "auto" heuristic that seems to be quite robust at picking the most efficient solver. See the benchmark results in the PR: #27491 (comment)

Thanks for taking the time to develop a good heuristic for this! I expect we can probably use your heuristic in this PR as well. I'll update and test that soon. Admittedly I was struggling to find a good heuristic that didn't oversimply, but it looks like you've landed on a good one :D

@ogrisel
Copy link
Member Author
ogrisel commented May 16, 2024

Implemented in #27491.

@ogrisel ogrisel closed this as completed May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Array API Enhancement Moderate Anything that requires some knowledge of conventions and best practices Performance
3CDA
Projects
None yet
Development

No branches or pull requests

3 participants
0