-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Solve PCA via np.linalg.eigh(X_centered.T @ X_centered)
instead of np.linalg.svd(X_centered)
when X.shape[1]
is small enough.
#27483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I now realize that this is very likely the reason why Here is a screenshot of the profiling trace of this benchmark in the https://speedscope.app interactive viewer: vs the trace for scikit-learn-intelex: |
This sounds good. It has also been applied in if self.sign_flip:
u *= np.sign(u[0]) A nice benefit is that the memory savings are potentially even more dramatic than the compute savings. As far as API is concerned, we resolved to add a separate |
FYI, I am giving it a shot. |
Just note that this doubles the condition number and is therefore less numerical stable (eg on matrices X with large range of singular values). This could be mentioned in the docs. |
Thanks, I'll do that. Let's move the discussion to the PR. @Micky774 I implemented an updated the "auto" heuristic that seems to be quite robust at picking the most efficient solver. See the benchmark results in the PR: #27491 (comment) |
Thanks for taking the time to develop a good heuristic for this! I expect we can probably use your heuristic in this PR as well. I'll update and test that soon. Admittedly I was struggling to find a good heuristic that didn't oversimply, but it looks like you've landed on a good one :D |
Implemented in #27491. |
Uh oh!
There was an error while loading. Please reload this page.
Describe the workflow you want to enable
Assuming that
X.shape[0] >> X.shape[1]
andX.shape[1]
is small enough to materialize the covariance matrixX.T @ X
, then using an eigensolver of the covariance matrix is much faster than the SVD of the centered data. See the proof of concept below:Describe your proposed solution
That's a 32x speed-up of the
"full"
solver. But it's also much faster than truncated randomized solver, even whenn_components
is quite low:And the results are the same (up to sign flips, hence the use of
np.abs
):Note that we might need to adapt the
svd_flip
helper (or even introduceeigen_flip
) to do a proper deterministic sign swap for that strategy.The SVD variant returns
U
which is useful for a cheapfit_transform
but whenX.shape[0] >> X.shape[1]
andX.shape[1]
is likely thatfit_transform()
implemented asfit().transform()
will still be much faster witheigh
on the covariance matrix.Assuming we agree to add this strategy to scikit-learn, we have two options:
svd_solver="covariance_eigh"
option, even if we don't actually perform an SVD anymore,svd_solver="full"
option smart and useeigh
on covariance wheneverX.shape[0] >> X.shape[1]
.In either cases we might want to adjust the logic of
svd_solver="auto"
to use this whenever it makes sense (some benchmarking by adaptingbenchmarks/bench_kernel_pca_solvers_time_vs_n_components.py
andbenchmarks/bench_kernel_pca_solvers_time_vs_n_samples.py
will be needed).Additional context
It might also be possi 8000 ble to do that for implicitly centered sparse data via a linear operator (as done in #18689) to form the covariance matrix (which is likely to be nearly dense when
X.shape[0] >> X.shape[1]
).Also note that this should work with the Array API (see array_api.linalg.eigh), and assuming
array-api-compat
provides a dask adaptor, this would be a very efficient way to do out-of-core (and even distributed) PCA on very large datasets (assumingX.shape[1]
is small enough). It might be even more efficient thanIncrementalPCA
.The text was updated successfully, but these errors were encountered: