8000 FEA Fused sparse-dense support for `PairwiseDistancesReduction` by jjerphan · Pull Request #23585 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@jjerphan
Copy link
Member
@jjerphan jjerphan commented Jun 10, 2022

Reference Issues/PRs

Comes after #23515.

Relates to #22587.

What does this implement/fix? Explain your changes.

Add SparseSparseDatasetsPair, SparseDenseDatasetsPair, DenseSparseDatasetsPair to bridge distances computations for pairs of fused dense-sparse datasets.

This allows support for dense-sparse, sparse-dense and sparse-sparse datasets' pairs for a variety of estimators while at the same time improving performance for the existing sparse-sparse case.

Note that this does not implement optimisation for the Euclidean and Squared Euclidean distances for reasons explained in #23585 (comment).

TODO

@jjerphan jjerphan changed the title POC Sparse support for PairwiseDistancesReduction MAINT Sparse support for PairwiseDistancesReduction Jun 15, 2022
@jjerphan jjerphan changed the title MAINT Sparse support for PairwiseDistancesReduction MAINT Fused sparse-dense support for PairwiseDistancesReduction Jun 15, 2022
jjerphan and others added 18 commits June 16, 2022 16:49
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This is kind of an hack for now.

IMO, it would be better to use a flatiter on a view if possible.
See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jjerphan
Copy link
Member Author
jjerphan commented Sep 12, 2022

I just have added minimal changes to tests and documentation for user API.

I think we can tests combinations of sparse and dense datasets more thoroughly and systematically. This necessitates refactoring the tests, which I think we might prefer doing in another PR.

I think this PR is mergeable.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments below but otherwise the diff looks good to me.

However when experimenting with this PR locally I found a quite significant performance regression in LocalOutlierFactor and Birch on sparse float32 data (I am not yet systematically tested the others), typically around 1.2x to 1.5x slower.

For instance for Birch:

  • on main:
In [1]: from sklearn.cluster import Birch
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time Birch().fit(X)
CPU times: user 41.6 s, sys: 1.15 s, total: 42.7 s
Wall time: 10.9 s
  • on this branch:
In [1]: from sklearn.cluster import Birch
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time Birch().fit(X)
CPU times: user 1min 24s, sys: 619 ms, total: 1min 25s
Wall time: 12.6 s

and for LOF:

  • on main:
In [1]: from sklearn.neighbors import LocalOutlierFactor
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time lof = LocalOutlierFactor().fit(X)
   ...: lof.negative_outlier_factor_
CPU times: user 25.3 s, sys: 3.21 s, total: 28.5 s
Wall time: 28.5 s
Out[1]: 
array([-2.25244  , -3.163138 , -3.0220103, ..., -3.5005074, -3.367448 ,
       -2.2913156], dtype=float32)
  • on this branch:
In [1]: from sklearn.neighbors import LocalOutlierFactor
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time lof = LocalOutlierFactor().fit(X)
   ...: lof.negative_outlier_factor_
CPU times: user 4min 48s, sys: 698 ms, total: 4min 49s
Wall time: 37.5 s
Out[1]: 
array([-2.25244047, -3.16313819, -3.02201071, ..., -3.50050721,
       -3.36744845, -2.29131578])

Also not that the .negative_outlier_factor_ fitted attribute used to be float32 on main while it is upcasted to float64 in this branch.

@jjerphan
Copy link
Member Author
jjerphan commented Sep 15, 2022

The previous back-end is sometimes more performant because it manages to use the same decomposition for chunks of the Squared Euclidean distance matrix used GRMM in the dense case, namely:

$$ \mathbf D^{\odot 2}_d \left(\mathbf{X}_{c}^{(l)}, \mathbf{Y}_{c}^{(k)}\right) \overset{\text{euclidean case}}{=} \left[\left\Vert \left({\mathbf{X}_c^{(l)}}\right)_{i\cdot}\right\Vert^2_2 \right]_{(i,j)\in [c]\times[c]} + \left[\left\Vert \left({\mathbf{Y}_c^{(k)}}\right)_{j\cdot }\right\Vert^2_2 \right]_{(i, j)\in [c]\times[c]} - 2 \mathbf{X}_c^{(l)} {\mathbf{Y}_c^{(k)}}^\top $$

where the two first terms (likely) are computed once and are reused and the last is computed via safe_sparse_dot, e.g. euclidean_distances for the sparse float32 case:

d = -2 * safe_sparse_dot(X_chunk, Y_chunk.T, dense_output=True)
d += XX_chunk
d += YY_chunk

In this case, safe_sparse_dot calls scipy._csr_array.__matmul__ which ultimately routes to the following C++ routines:

  • csr_matmat for the CSR × CSR case
  • csr_matvecs for the CSR × Dense case
  • csc_matvecs for the Dense × CSR case (seen as transposed result of the CSR × Dense case)

I think we might be able to use the Euclidean specialisation for the fused {sparse, dense}² by extending the GEMMTermComputer case to call relevant SciPy's C++ routines (in this case GEMMTermComputer will be renamed because GEMM won't be called since sparse data will be supported used).

I would prefer exploring this in another PR and in the meantime just deactivate the new implementations proposed in this PR for user-facing APIs that are subjects to regressions (by using a sklearn.config_context-context manager or by overriding is_usable_for to return False if metric="*euclidean" and at least one sparse dataset is passed as done in 1d7bcc7).

What do you think?

`py-spy` profile
# script.py
from sklearn.cluster import Birch
from scipy import sparse as sp
import numpy as np

X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
Birch().fit(X)
py-spy record --native -o py-spy-`git rev-parse HEAD`.svg -f speedscope -- python ./script.py

On main: py-spy-profile-851d02f2ae0dbf258f7cd2f1214ba47c79d1737a

On this branch: py-spy-profile-1eb5b2c3ce31f96eff0922adbdd236c5f0a5c879

To browse on https://www.speedscope.app/

@ogrisel
Copy link
Member
ogrisel commented Sep 16, 2022

What do you think?

Thanks for your investigations. I agree with your plan.

Before merging this, what do you think of #23585 (comment) ?

@jjerphan
Copy link
Member Author

Before merging this, what do you think of #23585 (comment) ?

Oh, I already have accepted this sensible suggestion in fcf15b6. 🙂

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I did some manual tests quickly with the updated PR and did not observe any regression.

Copy link
Contributor
@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to look at this. Overall looks great! I just left a minor suggestion for simplifying the tests a bit, and a couple of optional nits.

Edit: Should this also remove the comments here:

# TODO: once ArgKmin supports sparse input matrices and 32 bit,
# we won't need to fallback to pairwise_distances_chunked anymore.

and here:

# TODO: once ArgKmin supports sparse input matrices and 32 bit,
# we won't need to fallback to pairwise_distances_chunked anymore.

jjerphan and others added 2 commits September 20, 2022 09:56
…and add a space somewhere for proper formatting.

Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>
See: `BaseDistanceReductionDispatcher.valid_metrics`

Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>
@ogrisel ogrisel merged commit 60cc5b5 into scikit-learn:main Sep 20, 2022
@ogrisel
Copy link
Member
ogrisel commented Sep 20, 2022

Merged! Thank you @jjerphan and everybody else!

@jjerphan
Copy link
Member Author

Thank you for the reviews, @ogrisel, @thomasjpfan and @Micky774!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet