Solve PCA via np.linalg.eigh(X_centered.T @ X_centered) instead of np.linalg.svd(X_centered) when X.shape[1] is small enough.

Describe the workflow you want to enable

Assuming that X.shape[0] >> X.shape[1] and X.shape[1] is small enough to materialize the covariance matrix X.T @ X, then using an eigensolver of the covariance matrix is much faster than the SVD of the centered data. See the proof of concept below:

Describe your proposed solution

>>> import numpy as np
>>> X = np.random.randn(int(1e6), 100)
>>> X -= X.mean(axis=0) # does not impact speed but required for PCA
>>> %time U, s, Vt = np.linalg.svd(X, full_matrices=False)
CPU times: user 22.2 s, sys: 1.59 s, total: 23.8 s
Wall time: 8.18 s
>>> %time eigenvals, eigenvecs = np.linalg.eigh(X.T @ X)
CPU times: user 81.1 ms, sys: 172 ms, total: 253 ms
Wall time: 255 ms

That's a 32x speed-up of the "full" solver. But it's also much faster than truncated randomized solver, even when n_components is quite low:

>>> from sklearn.utils.extmath import randomized_svd
>>> %time U, s, Vt = randomized_svd(X, n_components=2)
CPU times: user 2.76 s, sys: 722 ms, total: 3.48 s
Wall time: 2.45 s
>>> %time U, s, Vt = randomized_svd(X, n_components=5)
CPU times: user 3.68 s, sys: 688 ms, total: 4.36 s
Wall time: 2.67 s
>>> %time U, s, Vt = randomized_svd(X, n_components=10)
CPU times: user 3.55 s, sys: 808 ms, total: 4.35 s
Wall time: 2.81 s
>>> %time U, s, Vt = randomized_svd(X, n_components=50)
CPU times: user 16.3 s, sys: 3.77 s, total: 20.1 s
Wall time: 11.9 s

And the results are the same (up to sign flips, hence the use of np.abs):

>>> np.allclose(s ** 2, eigenvals[::-1])
True
>>> np.allclose(np.abs(eigenvecs[:, ::-1].T), np.abs(Vt))
True

Note that we might need to adapt the svd_flip helper (or even introduce eigen_flip) to do a proper deterministic sign swap for that strategy.

The SVD variant returns U which is useful for a cheap fit_transform but when X.shape[0] >> X.shape[1] and X.shape[1] is likely that fit_transform() implemented as fit().transform() will still be much faster with eigh on the covariance matrix.

Assuming we agree to add this strategy to scikit-learn, we have two options:

introduce a new svd_solver="covariance_eigh" option, even if we don't actually perform an SVD anymore,
alternatively we could make the svd_solver="full" option smart and use eigh on covariance whenever X.shape[0] >> X.shape[1].

In either cases we might want to adjust the logic of svd_solver="auto" to use this whenever it makes sense (some benchmarking by adapting benchmarks/bench_kernel_pca_solvers_time_vs_n_components.py and benchmarks/bench_kernel_pca_solvers_time_vs_n_samples.py will be needed).

Additional context

It might also be possible to do that for implicitly centered sparse data via a linear operator (as done in #18689) to form the covariance matrix (which is likely to be nearly dense when X.shape[0] >> X.shape[1]).

Also note that this should work with the Array API (see array_api.linalg.eigh), and assuming array-api-compat provides a dask adaptor, this would be a very efficient way to do out-of-core (and even distributed) PCA on very large datasets (assuming X.shape[1] is small enough). It might be even more efficient than IncrementalPCA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Describe the workflow you want to enable

Describe your proposed solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions