8000 ENH new svd_solver='covariance_eigh' for PCA by ogrisel · Pull Request #27491 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH new svd_solver='covariance_eigh' for PCA #27491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 87 commits into from
Apr 11, 2024

Conversation

ogrisel
Copy link
Member
@ogrisel ogrisel commented Sep 28, 2023

Implements the new solver proposed in #27483.

I also took the opportunity of this PR to strengthen the solver equivalence tests.

Benchmark

Laptop CPU with numpy on macOS (with Apple Accelerate BLAS)

>>> import numpy as np
>>> from sklearn.datasets import make_low_rank_matrix
>>> X = make_low_rank_matrix(n_samples=int(1e6), n_features=100).astype(np.float32)
>>> from sklearn.decomposition import PCA
>>> %time PCA(n_components=5, svd_solver="randomized").fit_transform(X)[-1]
CPU times: user 1.67 s, sys: 186 ms, total: 1.85 s
Wall time: 1.1 s
array([ 0.00055511, -0.00047962,  0.0009717 , -0.00128793, -0.00048758],
      dtype=float32
>>> %time PCA(n_components=5, svd_solver="full").fit_transform(X)[-1]
CPU times: user 11.8 s, sys: 298 ms, total: 12.1 s
Wall time: 3.56 s
array([ 0.00055506, -0.00047963,  0.00097174, -0.00128779, -0.00048765],
      dtype=float32)
>>> %time PCA(n_components=5, svd_solver="covariance_eigh").fit_transform(X)[-1]
CPU times: user 121 ms, sys: 3.3 ms, total: 124 ms
Wall time: 123 ms
array([ 0.00055505, -0.00047963,  0.00097174, -0.0012878 , -0.00048764],
      dtype=float32)

so around a 10x speed-up.

Switching to OpenBLAS makes it 2x or so slower than Apple Accelerate but still very good.

Note: to switch the BLAS implementation with conda-forge, there is no need to rebuild scikit-learn or scipy. Just the following commands:

mamba install "libblas=*=*accelerate"

or

mamba install "libblas=*=*openblas"

PyTorch via Array API with MPS device on Apple M1

>>> import numpy as np
>>> from sklearn.datasets import make_low_rank_matrix
>>> X = make_low_rank_matrix(n_samples=int(1e6), n_features=100).astype(np.float32)
>>> import torch
>>> X_torch_mps = torch.tensor(X).to("mps")
>>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> from sklearn.decomposition import PCA
>>> %time PCA(n_components=5, svd_solver="randomized", power_iteration_normalizer="QR").fit_transform(X_torch_mps)[-1]
/Users/ogrisel/code/scikit-learn/sklearn/utils/extmath.py:319: UserWarning: The operator 'aten::linalg_qr.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/miniforge3/conda-bld/pytorch-recipe_1690825982320/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  Q, _ = normalizer(A @ Q)
CPU times: user 3.48 s, sys: 474 ms, total: 3.96 s
Wall time: 3.02 s
tensor([ 0.0003, -0.0009,  0.0004, -0.0002, -0.0002], device='mps:0')
>>> %time PCA(n_components=5, svd_solver="full").fit_transform(X_torch_mps)[-1]
CPU times: user 11.4 s, sys: 483 ms, total: 11.9 s
Wall time: 3.83 s
tensor([ 0.0003, -0.0009,  0.0004, -0.0002, -0.0002], device='mps:0'
>>> %time PCA(n_components=5, svd_solver="covariance_eigh").fit_transform(X_torch_mps)[-1]
CPU times: user 11.6 ms, sys: 69.7 ms, total: 81.4 ms
Wall time: 368 ms
tensor([ 0.0003, -0.0009,  0.0004, -0.0002, -0.0002], device='mps:0')

As the warning said, the MPS backend for PyTorch still lacks torch.linalg.* kernels and fallbacks to the CPU which makes the overall performance a bit slower than just using numpy on CPU. I can also try directly using PyTorch on the CPU (without the MPS backend):

>>> X_torch_cpu = torch.tensor(X)
>>> %time PCA(n_components=5, svd_solver="randomized", power_iteration_normalizer="QR").fit_transform(X_torch_cpu)[-1]
CPU times: user 3.12 s, sys: 692 ms, total: 3.81 s
Wall time: 1.44 s
tensor([ 0.0003, -0.0009,  0.0004, -0.0002, -0.0002])
>>> %time PCA(n_components=5, svd_solver="full").fit_transform(X_torch_cpu)[-1]
CPU times: user 11.3 s, sys: 535 ms, total: 11.8 s
Wall time: 3.23 s
tensor([ 0.0003, -0.0009,  0.0004, -0.0002, -0.0002])
>>> %time PCA(n_components=5, svd_solver="covariance_eigh").fit_transform(X_torch_cpu)[-1]
CPU times: user 198 ms, sys: 18.3 ms, total: 217 ms
Wall time: 112 ms
tensor([ 0.0003, -0.0009,  0.0004, -0.0002, -0.0002])

So we get similar results as with numpy on the same host.

Intel Xeon CPU (44 physical cores from 2018 or so) with numpy on Linux (with MKL BLAS)

I get similar numbers (slightly slower) as on my macOS laptop for the new solver. I inspected the use of threads with htop (on a 10x larger dataset) and I could see that MKL (and OpenBLAS) decide to only use 1 thread for the matrix matrix multiplication (to compute the covariance). This is probably because the matrix is very skinny and concurrent updates by several threads would be detrimental. As a result, using a multi-core machine is not very beneficial for this solver on this kind of data (but it's still more 5x faster than using the randomized solver that actually uses many cores on htop, but not 100% of the time).

Switching to OpenBLAS makes it 2x or so slower than MKL but still very good.

Anyways, the performance (with MKL) seems to be competitive with the scikit-learn-intelex benchmark results we collected a while ago. Note that I cannot reproduce those values with the latest version of scikit-learn-intelex because it seems to use the scikit-learn "randomized" solver instead of their own in the latest version.

PyTorch via Array API with cuda (NVIDIA V100 GPU)

>>> import numpy as np
>>> from sklearn.datasets import make_low_rank_matrix
>>> X = make_low_rank_matrix(n_samples=int(1e6), n_features=100).astype(np.float32)
>>> import torch
>>> X_torch_cuda = torch.tensor(X).to("cuda")
>>> import sklearn
>>> sklearn.set_config(array_api_dispatch=True)
>>> from sklearn.decomposition import PCA
>>> %time PCA(n_components=5, svd_solver="randomized", power_iteration_normalizer="QR").fit_transform(X_torch_cuda)[-1]
CPU times: user 63.8 ms, sys: 15.5 ms, total: 79.3 ms
Wall time: 78.3 ms
tensor([ 5.4981e-06,  7.2371e-04,  5.3485e-04, -1.3649e-03, -1.4326e-03],
       device='cuda:0')
>>> %time PCA(n_components=5, svd_solver="full").fit_transform(X_torch_cuda)[-1]
CPU times: user 81.4 ms, sys: 19.7 ms, total: 101 ms
Wall time: 99.8 ms
tensor([ 5.5376e-06,  7.2371e-04,  5.3483e-04, -1.3650e-03, -1.4326e-03],
       device='cuda:0')
>>> %time PCA(n_components=5, svd_solver="covariance_eigh").fit_transform(X_torch_cuda)[-1]
CPU times: user 7.67 ms, sys: 362 µs, total: 8.04 ms
Wall time: 7.21 ms
tensor([ 5.4768e-06,  7.2370e-04,  5.3485e-04, -1.3650e-03, -1.4326e-03],
       device='cuda:0')

So here again we observe a 10x speed-up with the new solver.

We can compare to NVIDIA's cuML from RAPIDS 23.08 on the same data either provided as CPU allocated numpy array or GPU allocated CuPy:

>>> import numpy as np
>>> from sklearn.datasets import make_low_rank_matrix
>>> X = make_low_rank_matrix(n_samples=int(1e6), n_features=100).astype(np.float32)
>>> from cuml.decomposition import PCA
>>> %time PCA(n_components=5).fit_transform(X)[-1]
CPU times: user 177 ms, sys: 42.5 ms, total: 220 ms
Wall time: 224 ms
array([-0.00171027,  0.00036515, -0.00089102,  0.00108428, -0.00054887],
      dtype=float32)
>>> import cupy
>>> X_cupy = cupy.asarray(X)
>>> %time PCA(n_components=5).fit_transform(X_cupy)[-1]
CPU times: user 25.3 ms, sys: 11.5 ms, total: 36.8 ms
Wall time: 36.8 ms
array([-0.00171027,  0.00036515, -0.00089102,  0.00108428, -0.00054887],
      dtype=float32)

So when the data is pre-allocated on the cuda device, our PyTorch + cuda implementation is 5x faster than cuML's!

PyTorch via Array API on Intel Max GPU

The performance is not good at all, but I suspect that the Intel extension for PyTorch is not ready yet. I opened an issue to track progress upstream:

TODO:

  • add a changelog entry,
  • run some benchmarks,
  • check that Array API compliance tests works on cuda, including with the new solver
  • compare the new solver to cuML on GPU
  • fix remaining CI failures,
  • improve the test by also checking consistent output for transform and fit_transform,
  • add a new commit to use the new solver when it makes sense by updating the dimension dependent benchmark scripts (to be discussed next),
  • compare the new solver to scikit-learn-intelex on CPU

Note that I had to change the way we use svd_flip in this estimator to be able to use get the same signs for the components of the new solver. I think it's better for scikit-learn to be self-consistent (checked in the solver equivalence test) than consistent across versions.

@github-actions
Copy link
github-actions bot commented Sep 28, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9842711. Link to the linter CI: here

@ogrisel
Copy link
Member Author
ogrisel commented Sep 28, 2023

I get a pytest xdist bug on the CI:

_____________________________ ERROR collecting gw0 _____________________________
Different tests were collected between gw1 and gw0. The difference is:
--- gw1

+++ gw0

@@ -4099,12 +4099,12 @@

 decomposition/tests/test_pca.py::test_whitening[randomized-False]
 decomposition/tests/test_pca.py::test_whitening[auto-True]
 decomposition/tests/test_pca.py::test_whitening[auto-False]
+decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-arpack-tall]
+decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-arpack-wide]
 decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-randomized-tall]
 decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-randomized-wide]
 decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-covariance_eigh-tall]
 decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-covariance_eigh-wide]
-decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-arpack-tall]
-decomposition/tests/test_pca.py::test_pca_solver_equivalence[42-arpack-wide]
 decomposition/tests/test_pca.py::test_pca_explained_variance_empirical[full-random-tall]
 decomposition/tests/test_pca.py::test_pca_explained_variance_empirical[full-correlated-tall]
 decomposition/tests/test_pca.py::test_pca_explained_variance_empirical[full-random-wide]
--------- generated xml file: /home/vsts/work/tmp_folder/test-data.xml ---------
=========================== short test summary info ============================
ERROR gw0
================== 1 skipped, 156 warnings, 1 error in 38.74s ==================

I don't remember what is the workaround for this...

EDIT: that was caused by the use of set operations in the parametrization of the test. It's fixed now.

@ogrisel ogrisel marked this pull request as draft September 28, 2023 15:42
@ogrisel
Copy link
Member Author
ogrisel commented Sep 28, 2023

The Array API compliance tests pass for the new solver both for PyTorch with MPS on Apple M1 and for PyTorch cuda and CuPy with an NVIDIA GPU!

Will run some quick benchmarks on cuda and post them in the description.

EDIT: done. The performance is very good for cuda. For MPS it's not terrible but not better than the CPU device.

@ogrisel
Copy link
Member Author
ogrisel commented Apr 7, 2024

@lorentzenchr I updated the PR to take the latest batch of review comments into account.

do you prefer the 2nd review approval from me?

Any second review approval will do, when people have time.

@ogrisel
Copy link
Member Author
ogrisel commented Apr 7, 2024

I re-ran the benchmark script and the auto strategy is still near-optimally picking the fastest solver as originally reported in #27491 (comment).

@glemaitre glemaitre self-requested a review April 8, 2024 13:04
@ogrisel
Copy link
Member Author
ogrisel commented Apr 9, 2024

BTW, it would be interesting to try this new solver with dask in an out-of-core setting where n_samples is very large when #28588 is merged.

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of non-code.

ogrisel and others added 4 commits April 10, 2024 16:52
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (despite the fact that I'm not that much into PCA 😏)

@lorentzenchr
Copy link
Member
lorentzenchr commented Apr 10, 2024

@ogrisel Ping (me/us) when ready to merge.

ogrisel and others added 2 commits April 10, 2024 20:41
@ogrisel
Copy link
Member Author
ogrisel commented Apr 10, 2024

@ogrisel Ping (me/us) when ready to merge.

@lorentzenchr this is ready from my point of view.

@ogrisel
Copy link
Member Author
ogrisel commented Apr 10, 2024

I forgot about #27491 (comment). Let me give it a shot.

EDIT: this should be good now.

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last review comment 🤞

@ogrisel
Copy link
Member Author
ogrisel commented Apr 11, 2024

@lorentzenchr this should be good to go now. For information I also did the following experiment to check numerical stability in the n_samples >> n_features > rank setting (compared to the full solver):

In [1]: import numpy as np

In [2]: X_low = np.random.randn(int(1e7), 5)

In [3]: W = np.random.randn(5, 30)

In [4]: X = X_low @ W

In [5]: X += np.random.randn(*X.shape) * 1e-12  # rank 5 up to eps-level noise

In [6]: from sklearn.decomposition import PCA

In [7]: PCA(svd_solver="covariance_eigh").fit(X).explained_variance_
Out[7]: 
array([5.13565104e+01, 4.16859476e+01, 2.90827456e+01, 2.34093563e+01,
       1.54211597e+01, 5.95719954e-14, 5.63161175e-14, 5.02011137e-14,
       4.21787206e-14, 2.90082256e-14, 2.88704010e-14, 1.78452632e-14,
       1.53384409e-14, 1.36259394e-14, 1.14775441e-14, 9.65281151e-15,
       4.52684746e-15, 2.73330102e-15, 2.23056558e-17, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00])

In [8]: PCA(svd_solver="full").fit(X).explained_variance_
Out[8]: 
array([5.13565104e+01, 4.16859476e+01, 2.90827456e+01, 2.34093563e+01,
       1.54211597e+01, 1.00568294e-24, 1.00282867e-24, 1.00269022e-24,
       1.00233412e-24, 1.00208860e-24, 1.00173990e-24, 1.00155317e-24,
       1.00116836e-24, 1.00078966e-24, 1.00051726e-24, 1.00049199e-24,
       1.00025070e-24, 1.00023886e-24, 1.00011534e-24, 9.99640288e-25,
       9.99398032e-25, 9.99344471e-25, 9.99024388e-25, 9.98812971e-25,
       9.98585628e-25, 9.98398337e-25, 9.98292080e-25, 9.97761657e-25,
       9.97621196e-25, 9.97159309e-25])

So I think the new solver is fine. The non-eps level components of the spectrum are recovered correctly and the other are below eps level for both covariance_eigh and full solvers.

@lorentzenchr lorentzenchr merged commit 311c6e2 into scikit-learn:main Apr 11, 2024
30 checks passed
@ogrisel ogrisel deleted the pca-covariance-eigh branch April 11, 2024 16:25

.. versionadded:: 0.18.0

.. versionchanged:: 1.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be 1.5?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Well spotted. Do you want to open a PR to fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0