Description
Describe the workflow you want to enable
Assuming that X.shape[0] >> X.shape[1]
and X.shape[1]
is small enough to materialize the covariance matrix X.T @ X
, then using an eigensolver of the covariance matrix is much faster than the SVD of the centered data. See the proof of concept below:
Describe your proposed solution
>>> import numpy as np >>> X = np.random.randn(int(1e6), 100) >>> X -= X.mean(axis=0) # does not impact speed but required for PCA >>> %time U, s, Vt = np.linalg.svd(X, full_matrices=False) CPU times: user 22.2 s, sys: 1.59 s, total: 23.8 s Wall time: 8.18 s >>> %time eigenvals, eigenvecs = np.linalg.eigh(X.T @ X) CPU times: user 81.1 ms, sys: 172 ms, total: 253 ms Wall time: 255 ms
That's a 32x speed-up of the "full"
solver. But it's also much faster than truncated randomized solver, even when n_components
is quite low:
>>> from sklearn.utils.extmath import randomized_svd
>>> %time U, s, Vt = randomized_svd(X, n_components=2)
CPU times: user 2.76 s, sys: 722 ms, total: 3.48 s
Wall time: 2.45 s
>>> %time U, s, Vt = randomized_svd(X, n_components=5)
CPU times: user 3.68 s, sys: 688 ms, total: 4.36 s
Wall time: 2.67 s
>>> %time U, s, Vt = randomized_svd(X, n_components=10)
CPU times: user 3.55 s, sys: 808 ms, total: 4.35 s
Wall time: 2.81 s
>>> %time U, s, Vt = randomized_svd(X, n_components=50)
CPU times: user 16.3 s, sys: 3.77 s, total: 20.1 s
Wall time: 11.9 s
And the results are the same (up to sign flips, hence the use of np.abs
):
>>> np.allclose(s ** 2, eigenvals[::-1])
True
>>> np.allclose(np.abs(eigenvecs[:, ::-1].T), np.abs(Vt))
True
Note that we might need to adapt the svd_flip
helper (or even introduce eigen_flip
) to do a proper deterministic sign swap for that strategy.
The SVD variant returns U
which is useful for a cheap fit_transform
but when X.shape[0] >> X.shape[1]
and X.shape[1]
is likely that fit_transform()
implemented as fit().transform()
will still be much faster with eigh
on the covariance matrix.
Assuming we agree to add this strategy to scikit-learn, we have two options:
- introduce a new
svd_solver="covariance_eigh"
option, even if we don't actually perform an SVD anymore, - alternatively we could make the
svd_solver="full"
option smart and useeigh
on covariance wheneverX.shape[0] >> X.shape[1]
.
In either cases we might want to adjust the logic of svd_solver="auto"
to use this whenever it makes sense (some benchmarking by adapting benchmarks/bench_kernel_pca_solvers_time_vs_n_components.py
and benchmarks/bench_kernel_pca_solvers_time_vs_n_samples.py
will be needed).
Additional context
It might also be possible to do that for implicitly centered sparse data via a linear operator (as done in #18689) to form the covariance matrix (which is likely to be nearly dense when X.shape[0] >> X.shape[1]
).
Also note that this should work with the Array API (see array_api.linalg.eigh), and assuming array-api-compat
provides a dask adaptor, this would be a very efficient way to do out-of-core (and even distributed) PCA on very large datasets (assuming X.shape[1]
is small enough). It might be even more efficient than IncrementalPCA
.