8000 ENH Allow fitting PCA on sparse X with arpack solvers by ivirshup · Pull Request #18689 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Allow fitting PCA on sparse X with arpack solvers #18689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
df6232d
Initial commit for PCA on sparse data
ivirshup Oct 27, 2020
14a8954
Allow pca.transform(spmatrix)
ivirshup Oct 27, 2020
ed87334
cleaner implicit centering
ivirshup Mar 26, 2021
ce97d14
Merge branch 'main' into pca-sparse-merge
ivirshup May 23, 2023
f5158a1
Integrate tests from Andrey
ivirshup May 24, 2023
88492fb
Clean up total var, pass tests
ivirshup May 24, 2023
fd4733b
Merge branch 'main' into pca-spar 8000 se
ivirshup Aug 18, 2023
6cbeca4
Merge branch 'main' into pca-sparse
ivirshup Aug 18, 2023
b3b1f5c
remove merge artifact
ivirshup Aug 18, 2023
74d7afe
Document test variables
ivirshup Aug 18, 2023
602e3bf
Address comments on scaling and tolerance for the test
ivirshup Aug 18, 2023
0b0d4cd
Document _implicitly_center
ivirshup Aug 18, 2023
f51f6c5
Add tests for erroring on unsupported solver
ivirshup Aug 21, 2023
f731195
Fix rmatmat
ivirshup Aug 21, 2023
80d9541
Make tolerance 1e-7
ivirshup Aug 22, 2023
1b69b48
release note
ivirshup Aug 25, 2023
9143651
Test implicit center
ivirshup Aug 25, 2023
ae0ea16
Merge branch 'main' into pca-sparse
ivirshup Aug 25, 2023
bcfe316
Array tests
ivirshup Aug 25, 2023
5c9627b
Rename _implict_column_offset
ivirshup Aug 25, 2023
858f437
scipy 1.11 compat, https://github.com/scipy/scipy/issues/19134
ivirshup Aug 25, 2023
0db8d54
Finishing touches
ivirshup Aug 25, 2023
a664999
Update release note
ivirshup Aug 25, 2023
757ed37
Merge branch 'main' into pca-sparse
ogrisel Sep 25, 2023
5329041
use toarray
ivirshup Oct 1, 2023
67a40dd
Expanded equality testing (apply comment from Julien)
ivirshup Oct 1, 2023
236478e
Make sure fit, fit_transform, transform are equivalent for pca on spa…
ivirshup Oct 1, 2023
5a15364
More informative error message for supported solvers for sparse PCA
ivirshup Oct 1, 2023
af56492
Remove skipped test case
ivirshup Oct 1, 2023
ce5e8d4
Update usage of random data in implicit centering tests
ivirshup Oct 1, 2023
0ce6d50
Test both matrix multiplication API and LinearOperator API in implici…
ivirshup Oct 1, 2023
a201440
Update implicit_offset docs
ivirshup Oct 1, 2023
05df49c
Remove type annotations for _implicit_column_offset, move to docs
ivirshup Oct 1, 2023
bfa4faa
Remove .conj
ivirshup Oct 1, 2023
9cbd76b
matrix_class -> sparse_container
ivirshup Oct 1, 2023
7c8bece
Update PCA doc strings to indicate sparse arrays now accepted
ivirshup Oct 1, 2023
0ed4d69
Update docs + examples to not recommend against PCA for sparse data
ivirshup Oct 1, 2023
7fe38e8
Add slightly higher tolerance for transforming
ivirshup Oct 1, 2023
e374fe2
Merge branch 'main' into pca-sparse
ivirshup Oct 1, 2023
1bfad54
Merge branch 'main' into pca-sparse
ivirshup Oct 8, 2023
3ca40e2
Remove deprecated attribute access
ivirshup Oct 8, 2023
f10cc33
Fix example usage
ivirshup Oct 8, 2023
0d91e1c
move import
ivirshup Oct 8, 2023
4bcac1f
remove annotations import
ivirshup Oct 8, 2023
e6daeec
Address suggestions to test_pca
ivirshup Oct 8, 2023
bf81430
made change to variance calculation
ivirshup Oct 8, 2023
e920da0
fix linting on test_sparsefuncs.py
ivirshup Oct 8, 2023
c7eda6a
speed ups -> speed-ups
ivirshup Oct 8, 2023
f9c5329 8000
BasePCA.transform docs updated to include support for sparse input
ivirshup Oct 8, 2023
e416879
Remove type hint
ivirshup Oct 8, 2023
a519798
Add TODO to variance calculation for PCA
ivirshup Oct 8, 2023
a647d8d
Dial in tolerances
ivirshup Oct 8, 2023
1ccb9a8
Remove last type annotation
ivirshup Nov 2, 2023
5bf6134
Merge branch 'main' into pca-sparse
ogrisel Nov 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions doc/modules/decomposition.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ out-of-core Principal Component Analysis either by:
* Using its ``partial_fit`` method on chunks of data fetched sequentially
from the local hard drive or a network database.

* Calling its fit method on a sparse matrix or a memory mapped file using
* Calling its fit method on a memory mapped file using
``numpy.memmap``.

:class:`IncrementalPCA` only stores estimates of component and noise variances,
Expand Down 8000 Expand Up @@ -420,10 +420,6 @@ in that the matrix :math:`X` does not need to be centered.
When the columnwise (per-feature) means of :math:`X`
are subtracted from the feature values,
truncated SVD on the resulting matrix is equivalent to PCA.
In practical terms, this means
that the :class:`TruncatedSVD` transformer accepts ``scipy.sparse``
matrices without the need to densify them,
as densifying may fill up memory even for medium-sized document collections.

While the :class:`TruncatedSVD` transformer
works with any feature matrix,
Expand Down
2 changes: 1 addition & 1 deletion doc/modules/manifold.rst
Original file line number Diff line number Diff line change
Expand Up @@ -644,7 +644,7 @@ Barnes-Hut method improves on the exact method where t-SNE complexity is
or less. The 2D case is typical when building visualizations.
* Barnes-Hut only works with dense input data. Sparse data matrices can only be
embedded with the exact method or can be approximated by a dense low rank
projection for instance using :class:`~sklearn.decomposition.TruncatedSVD`
projection for instance using :class:`~sklearn.decomposition.PCA`
* Barnes-Hut is an approximation of the exact method. The approximation is
parameterized with the angle parameter, therefore the angle parameter is
unused when method="exact"
Expand Down
7 changes: 7 additions & 0 deletions doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,13 @@ Changelog
:pr:`26315` and :pr:`27098` by :user:`Mateusz Sokół <mtsokol>`,
:user:`Olivier Grisel <ogrisel>` and :user:`Edoardo Abati <EdAbati>`.

- |Feature| :class:`decomposition.PCA` now supports :class:`scipy.sparse.sparray`
and :class:`scipy.sparse.spmatrix` inputs when using the `arpack` solver.
When used on sparse data like :func:`datasets.fetch_20newsgroups_vectorized` this
can lead to speed-ups of 100x (single threaded) and 70x lower memory usage.
Based on :user:`Alexander Tarashansky <atarashansky>`'s implementation in `scanpy <https://github.com/scverse/scanpy>`.
:pr:`18689` by :user:`Isaac Virshup <ivirshup>` and :user:`Andrey Portnoy <andportnoy>`.

:mod:`sklearn.ensemble`
.......................

Expand Down
4 changes: 2 additions & 2 deletions examples/compose/plot_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
Expand Down Expand Up @@ -141,7 +141,7 @@ def text_stats(posts):
Pipeline(
[
("tfidf", TfidfVectorizer()),
("best", TruncatedSVD(n_components=50)),
("best", PCA(n_components=50, svd_solver="arpack")),
]
),
1,
Expand Down
13 changes: 10 additions & 3 deletions sklearn/decomposition/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@

import numpy as np
from scipy import linalg
from scipy.sparse import issparse

from ..base import BaseEstimator, ClassNamePrefixFeaturesOutMixin, TransformerMixin
from ..utils._array_api import _add_to_diagonal, device, get_namespace
from ..utils.sparsefuncs import _implicit_column_offset
from ..utils.validation import check_is_fitted


Expand Down Expand Up @@ -126,7 +128,7 @@ def transform(self, X):

Parameters
----------
X : array-like of shape (n_samples, n_features)
X : {array-like, sparse matrix} of shape (n_samples, n_features)
New data, where `n_samples` is the number of samples
and `n_features` is the number of features.

Expand All @@ -140,9 +142,14 @@ def transform(self, X):

check_is_fitted(self)

X = self._validate_data(X, dtype=[xp.float64, xp.float32], reset=False)
X = self._validate_data(
X, accept_sparse=("csr", "csc"), dtype=[xp.float64, xp.float32], reset=False
)
if self.mean_ is not None:
X = X - self.mean_
if issparse(X):
X = _implicit_column_offset(X, self.mean_)
else:
X = X - self.mean_
X_transformed = X @ self.components_.T
if self.whiten:
X_transformed /= xp.sqrt(self.explained_variance_)
Expand Down
43 changes: 30 additions & 13 deletions sklearn/decomposition/_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from ..utils._param_validation import Interval, RealNotInt, StrOptions
from ..utils.deprecation import deprecated
from ..utils.extmath import fast_logdet, randomized_svd, stable_cumsum, svd_flip
from ..utils.sparsefuncs import _implicit_column_offset, mean_variance_axis
from ..utils.validation import check_is_fitted
from ._base import _BasePCA

Expand Down Expand Up @@ -422,7 +423,7 @@ def fit(self, X, y=None):

Parameters
----------
X : array-like of shape (n_samples, n_features)
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training data, where `n_samples` is the number of samples
and `n_features` is the number of features.

Expand All @@ -443,7 +444,7 @@ def fit_transform(self, X, y=None):

Parameters
----------
X : array-like of shape (n_samples, n_features)
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Training data, where `n_samples` is the number of samples
and `n_features` is the number of features.

Expand Down Expand Up @@ -476,12 +477,12 @@ def _fit(self, X):
"""Dispatch to the right submethod depending on the chosen solver."""
xp, is_array_api_compliant = get_namespace(X)

# Raise an error for sparse input.
# This is more informative than the generic one raised by check_array.
if issparse(X):
# Raise an error for sparse input and unsupported svd_solver
if issparse(X) and self.svd_solver != "arpack":
raise TypeError(
"PCA does not support sparse input. See "
"TruncatedSVD for a possible alternative."
'PCA only support sparse inputs with the "arpack" solver, while '
f'"{self.svd_solver}" was passed. See TruncatedSVD for a possible'
" alternative."
)
# Raise an error for non-Numpy input and arpack solver.
if self.svd_solver == "arpack" and is_array_api_compliant:
Expand All @@ -490,7 +491,11 @@ def _fit(self, X):
)

X = self._validate_data(
X, dtype=[xp.float64, xp.float32], ensure_2d=True, copy=self.copy
X,
dtype=[xp.float64, xp.float32],
accept_sparse=("csr", "csc"),
ensure_2d=True,
copy=self.copy,
)

# Handle n_components==None
Expand Down Expand Up @@ -622,8 +627,14 @@ def _fit_truncated(self, X, n_components, svd_solver):
random_state = check_random_state(self.random_state)

# Center data
self.mean_ = xp.mean(X, axis=0)
X -= self.mean_
total_var = None
if issparse(X):
self.mean_, var = mean_variance_axis(X, axis=0)
total_var = var.sum() * n_samples / (n_samples - 1) # ddof=1
X = _implicit_column_offset(X, self.mean_)
else:
self.mean_ = xp.mean(X, axis=0)
X -= self.mean_

if svd_solver == "arpack":
v0 = _init_arpack_v0(min(X.shape), random_state)
Expand Down Expand Up @@ -655,9 +666,15 @@ def _fit_truncated(self, X, n_components, svd_solver):

# Workaround in-place variance calculation since at the time numpy
# did not have a way to calculate variance in-place.
N = X.shape[0] - 1
X **= 2
total_var = xp.sum(xp.sum(X, axis=0) / N)
#
# TODO: update this code to either:
# * Use the array-api variance calculation, unless memory usage suffers
# * Update sklearn.utils.extmath._incremental_mean_and_var to support array-api
# See: https://github.com/scikit-learn/scikit-learn/pull/18689#discussion_r1335540991
if total_var is None:
N = X.shape[0] - 1
X **= 2
total_var = xp.sum(X) / N

self.explained_variance_ratio_ = self.explained_variance_ / total_var
self.singular_values_ = xp.asarray(S, copy=True) # Store the singular values.
Expand Down
143 changes: 130 additions & 13 deletions sklearn/decomposition/tests/test_pca.py
F438
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,28 @@
_get_check_estimator_ids,
check_array_api_input_and_values,
)
from sklearn.utils.fixes import CSR_CONTAINERS
from sklearn.utils.fixes import CSC_CONTAINERS, CSR_CONTAINERS

iris = datasets.load_iris()
PCA_SOLVERS = ["full", "arpack", "randomized", "auto"]

# `SPARSE_M` and `SPARSE_N` could be larger, but be aware:
# * SciPy's generation of random sparse matrix can be costly
# * A (SPARSE_M, SPARSE_N) dense array is allocated to compare against
SPARSE_M, SPARSE_N = 1000, 300 # arbitrary
SPARSE_MAX_COMPONENTS = min(SPARSE_M, SPARSE_N)


def _check_fitted_pca_close(pca1, pca2, rtol):
assert_allclose(pca1.components_, pca2.components_, rtol=rtol)
assert_allclose(pca1.explained_variance_, pca2.explained_variance_, rtol=rtol)
assert_allclose(pca1.singular_values_, pca2.singular_values_, rtol=rtol)
assert_allclose(pca1.mean_, pca2.mean_, rtol=rtol)
assert_allclose(pca1.n_components_, pca2.n_components_, rtol=rtol)
assert_allclose(pca1.n_samples_, pca2.n_samples_, rtol=rtol)
assert_allclose(pca1.noise_variance_, pca2.noise_variance_, rtol=rtol)
assert_allclose(pca1.n_features_in_, pca2.n_features_in_, rtol=rtol)


@pytest.mark.parametrize("svd_solver", PCA_SOLVERS)
@pytest.mark.parametrize("n_components", range(1, iris.data.shape[1]))
Expand All @@ -49,6 +66,118 @@ def test_pca(svd_solver, n_components):
assert_allclose(np.dot(cov, precision), np.eye(X.shape[1]), atol=1e-12)


@pytest.mark.parametrize("density", [0.01, 0.1, 0.30])
@pytest.mark.parametrize("n_components", [1, 2, 10])
@pytest.mark.parametrize("sparse_container", CSR_CONTAINERS + CSC_CONTAINERS)
@pytest.mark.parametrize("svd_solver", ["arpack"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this work with svd_solver="randomized"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a quick try (by allowing svd_solver="randomized" and updating the tests to try with that solver), and the test does not pass with rtol=1e-7: the difference are way too big so there is something wrong with the way this solver is implemented. +1 for trying to do this in a follow-up PR instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A strategy to implement this and identify the source of the discrepancy would be to first write tests to check that randomized_range_finder can work as expected on the implicitly centered linear operator and then move on to test equivalence on randomized_svd and then on PCA(svd_solver="randomized").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce svd_solver="randomized" in another PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce svd_solver="randomized" in another PR?

This is what I'd like to do.

This test was oringinally implemented in #24415 with support for randomized and lobpcg. But there are numerical stability issues, but I think it's worth getting just arpack through right now since the performance improvement is so great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future PR: the problem might be that the default choice of iterated_power is too low for such a strict test to pass. Increasing it locally (in the test) to a larger value such as 30 or 100 might make it work. I haven't tried yet.

@pytest.mark.parametrize("scale", [1, 10, 100])
def test_pca_sparse(
global_random_seed, svd_solver, sparse_container, n_components, density, scale
):
# Make sure any tolerance changes pass with SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all"
rtol = 5e-07
transform_rtol = 3e-05

random_state = np.random.default_rng(global_random_seed)
X = sparse_container(
sp.sparse.random(
SPARSE_M,
SPARSE_N,
random_state=random_state,
density=density,
)
)
# Scale the data + vary the column means
scale_vector = random_state.random(X.shape[1]) * scale
X = X.multiply(scale_vector)

pca = PCA(
n_components=n_components,
svd_solver=svd_solver,
random_state=global_random_seed,
)
pca.fit(X)

Xd = X.toarray()
pcad = PCA(
n_components=n_components,
svd_solver=svd_solver,
random_state=global_random_seed,
)
pcad.fit(Xd)

# Fitted attributes equality
_check_fitted_pca_close(pca, pcad, rtol=rtol)

# Test transform
X2 = sparse_container(
sp.sparse.random(
SPARSE_M,
SPARSE_N,
random_state=random_state,
density=density,
)
)
X2d = X2.toarray()

assert_allclose(pca.transform(X2), pca.transform(X2d), rtol=transform_rtol)
assert_allclose(pca.transform(X2), pcad.transform(X2d), rtol=transform_rtol)


@pytest.mark.parametrize("sparse_container", CSR_CONTAINERS + CSC_CONTAINERS)
def test_pca_sparse_fit_transform(global_random_seed, sparse_container):
random_state = np.random.default_rng(global_random_seed)
X = sparse_container(
sp.sparse.random(
SPARSE_M,
SPARSE_N,
random_state=random_state,
density=0.01,
)
)
X2 = sparse_container(
sp.sparse.random(
SPARSE_M,
SPARSE_N,
random_state=random_state,
density=0.01,
)
)

pca_fit = PCA(n_components=10, svd_solver="arpack", random_state=global_random_seed)
pca_fit_transform = PCA(
n_components=10, svd_solver="arpack", random_state=global_random_seed
)

pca_fit.fit(X)
transformed_X = pca_fit_transform.fit_transform(X)

_check_fitted_pca_close(pca_fit, pca_fit_transform, rtol=1e-10)
assert_allclose(transformed_X, pca_fit_transform.transform(X), rtol=2e-9)
assert_allclose(transformed_X, pca_fit.transform(X), rtol=2e-9)
assert_allclose(pca_fit.transform(X2), pca_fit_transform.transform(X2), rtol=2e-9)


@pytest.mark.parametrize("svd_solver", ["randomized", "full", "auto"])
@pytest.mark.parametrize("sparse_container", CSR_CONTAINERS + CSC_CONTAINERS)
def test_sparse_pca_solver_error(global_random_seed, svd_solver, sparse_container):
random_state = np.random.RandomState(global_random_seed)
X = sparse_container(
sp.sparse.random(
SPARSE_M,
SPARSE_N,
random_state=random_state,
)
)
pca = PCA(n_components=30, svd_solver=svd_solver)
error_msg_pattern = (
f'PCA only support sparse inputs with the "arpack" solver, while "{svd_solver}"'
" was passed"
)
with pytest.raises(TypeError, match=error_msg_pattern):
pca.fit(X)


def test_no_empty_slice_warning():
# test if we avoid numpy warnings for computing over empty arrays
n_components = 10
Expand Down Expand Up @@ -502,18 +631,6 @@ def test_pca_svd_solver_auto(data, n_components, expected_solver):
assert_allclose(pca_auto.components_, pca_test.components_)


@pytest.mark.parametrize("svd_solver", PCA_SOLVERS)
@pytest.mark.parametrize("csr_container", CSR_CONTAINERS)
def test_pca_sparse_input(svd_solver, csr_container):
X = np.random.RandomState(0).rand(5, 4)
X = csr_container(X)
assert sp.sparse.issparse(X)

pca = PCA(n_components=3, svd_solver=svd_solver)
with pytest.raises(TypeError):
pca.fit(X)


@pytest.mark.parametrize("svd_solver", PCA_SOLVERS)
def test_pca_deterministic_output(svd_solver):
rng = np.random.RandomState(0)
Expand Down
28 changes: 28 additions & 0 deletions sklearn/utils/sparsefuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
# License: BSD 3 clause
import numpy as np
import scipy.sparse as sp
from scipy.sparse.linalg import LinearOperator

from ..utils.fixes import _sparse_min_max, _sparse_nan_min_max
from ..utils.validation import _check_sample_weight
Expand Down Expand Up @@ -568,3 +569,30 @@ def csc_median_axis_0(X):
median[f_ind] = _get_median(data, nz)

return median


def _implicit_column_offset(X, offset):
"""Create an implicitly offset linear operator.

This is used by PCA on sparse data to avoid densifying the whole data
matrix.

Params
------
X : sparse matrix of shape (n_samples, n_features)
offset : ndarray of shape (n_features,)

Returns
-------
centered : LinearOperator
"""
offset = offset[None, :]
XT = X.T
return LinearOperator(
matvec=lambda x: X @ x - offset @ x,
matmat=lambda x: X @ x - offset @ x,
rmatvec=lambda x: XT @ x - (offset * x.sum()),
rmatmat=lambda x: XT @ x - offset.T @ x.sum(axis=0)[None, :],
dtype=X.dtype,
shape=X.shape,
)
Loading
0