FEA Fused sparse-dense support for `PairwiseDistancesReduction` #23585

jjerphan · 2022-06-10T16:32:45Z

Reference Issues/PRs

Comes after #23515.

Relates to #22587.

What does this implement/fix? Explain your changes.

Add SparseSparseDatasetsPair, SparseDenseDatasetsPair, DenseSparseDatasetsPair to bridge distances computations for pairs of fused dense-sparse datasets.

This allows support for dense-sparse, sparse-dense and sparse-sparse datasets' pairs for a variety of estimators while at the same time improving performance for the existing sparse-sparse case.

Note that this does not implement optimisation for the Euclidean and Squared Euclidean distances for reasons explained in #23585 (comment).

TODO

FEA CSR support for all DistanceMetric #23604
Add tests for new error messages
Add changelog entry for fused sparse-dense support on estimators
Perform benchmark vs. main via NearestNeighbors

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

This is kind of an hack for now. IMO, it would be better to use a flatiter on a view if possible. See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…diusNeighbors* methods

doc/whats_new/v1.2.rst

jjerphan · 2022-09-12T09:35:27Z

I just have added minimal changes to tests and documentation for user API.

I think we can tests combinations of sparse and dense datasets more thoroughly and systematically. This necessitates refactoring the tests, which I think we might prefer doing in another PR.

I think this PR is mergeable.

ogrisel

A few comments below but otherwise the diff looks good to me.

However when experimenting with this PR locally I found a quite significant performance regression in LocalOutlierFactor and Birch on sparse float32 data (I am not yet systematically tested the others), typically around 1.2x to 1.5x slower.

For instance for Birch:

on main:

In [1]: from sklearn.cluster import Birch
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time Birch().fit(X)
CPU times: user 41.6 s, sys: 1.15 s, total: 42.7 s
Wall time: 10.9 s

on this branch:

In [1]: from sklearn.cluster import Birch
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time Birch().fit(X)
CPU times: user 1min 24s, sys: 619 ms, total: 1min 25s
Wall time: 12.6 s

and for LOF:

on main:

In [1]: from sklearn.neighbors import LocalOutlierFactor
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time lof = LocalOutlierFactor().fit(X)
   ...: lof.negative_outlier_factor_
CPU times: user 25.3 s, sys: 3.21 s, total: 28.5 s
Wall time: 28.5 s
Out[1]: 
array([-2.25244  , -3.163138 , -3.0220103, ..., -3.5005074, -3.367448 ,
       -2.2913156], dtype=float32)

on this branch:

In [1]: from sklearn.neighbors import LocalOutlierFactor
   ...: from scipy import sparse as sp
   ...: import numpy as np
   ...: 
   ...: X = sp.random(50000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
   ...: %time lof = LocalOutlierFactor().fit(X)
   ...: lof.negative_outlier_factor_
CPU times: user 4min 48s, sys: 698 ms, total: 4min 49s
Wall time: 37.5 s
Out[1]: 
array([-2.25244047, -3.16313819, -3.02201071, ..., -3.50050721,
       -3.36744845, -2.29131578])

Also not that the .negative_outlier_factor_ fitted attribute used to be float32 on main while it is upcasted to float64 in this branch.

sklearn/neighbors/tests/test_lof.py

sklearn/metrics/tests/test_pairwise_distances_reduction.py

jjerphan · 2022-09-15T11:49:51Z

The previous back-end is sometimes more performant because it manages to use the same decomposition for chunks of the Squared Euclidean distance matrix used GRMM in the dense case, namely:

$$ \mathbf D^{\odot 2}_d \left(\mathbf{X}_{c}^{(l)}, \mathbf{Y}_{c}^{(k)}\right) \overset{\text{euclidean case}}{=} \left[\left\Vert \left({\mathbf{X}_c^{(l)}}\right)_{i\cdot}\right\Vert^2_2 \right]_{(i,j)\in [c]\times[c]} + \left[\left\Vert \left({\mathbf{Y}_c^{(k)}}\right)_{j\cdot }\right\Vert^2_2 \right]_{(i, j)\in [c]\times[c]} - 2 \mathbf{X}_c^{(l)} {\mathbf{Y}_c^{(k)}}^\top $$

where the two first terms (likely) are computed once and are reused and the last is computed via safe_sparse_dot, e.g. euclidean_distances for the sparse float32 case:

scikit-learn/sklearn/metrics/pairwise.py

Lines 568 to 570 in 9723115

    
           d = -2 * safe_sparse_dot(X_chunk, Y_chunk.T, dense_output=True) 
        
           d += XX_chunk 
        
           d += YY_chunk

In this case, safe_sparse_dot calls scipy._csr_array.__matmul__ which ultimately routes to the following C++ routines:

csr_matmat for the CSR × CSR case
csr_matvecs for the CSR × Dense case
csc_matvecs for the Dense × CSR case (seen as transposed result of the CSR × Dense case)

I think we might be able to use the Euclidean specialisation for the fused {sparse, dense}² by extending the GEMMTermComputer case to call relevant SciPy's C++ routines (in this case GEMMTermComputer will be renamed because GEMM won't be called since sparse data will be supported used).

I would prefer exploring this in another PR and in the meantime just deactivate the new implementations proposed in this PR for user-facing APIs that are subjects to regressions (by using a sklearn.config_context-context manager or by overriding is_usable_for to return False if metric="*euclidean" and at least one sparse dataset is passed as done in 1d7bcc7).

What do you think?

`py-spy` profile

# script.py
from sklearn.cluster import Birch
from scipy import sparse as sp
import numpy as np

X = sp.random(5000, 1000, density=0.01, format="csr", random_state=0).astype(np.float32)
Birch().fit(X)

py-spy record --native -o py-spy-`git rev-parse HEAD`.svg -f speedscope -- python ./script.py

On main:

On this branch:

To browse on https://www.speedscope.app/

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…nfig.

ogrisel · 2022-09-16T07:33:15Z

What do you think?

Thanks for your investigations. I agree with your plan.

Before merging this, what do you think of #23585 (comment) ?

jjerphan · 2022-09-16T07:49:51Z

Before merging this, what do you think of #23585 (comment) ?

Oh, I already have accepted this sensible suggestion in fcf15b6. 🙂

ogrisel

LGTM. I did some manual tests quickly with the updated PR and did not observe any regression.

Micky774

Sorry for taking so long to look at this. Overall looks great! I just left a minor suggestion for simplifying the tests a bit, and a couple of optional nits.

Edit: Should this also remove the comments here:

scikit-learn/sklearn/metrics/pairwise.py

Lines 691 to 692 in fec55bf

    
           # TODO: once ArgKmin supports sparse input matrices and 32 bit, 
        
           # we won't need to fallback to pairwise_distances_chunked anymore.

and here:

scikit-learn/sklearn/metrics/pairwise.py

Lines 802 to 803 in fec55bf

    
           # TODO: once ArgKmin supports sparse input matrices and 32 bit, 
        
           # we won't need to fallback to pairwise_distances_chunked anymore.

sklearn/metrics/tests/test_pairwise_distances_reduction.py

sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx.tp

…and add a space somewhere for proper formatting. Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>

See: `BaseDistanceReductionDispatcher.valid_metrics` Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>

ogrisel · 2022-09-20T09:11:33Z

Merged! Thank you @jjerphan and everybody else!

jjerphan · 2022-09-20T10:21:20Z

Thank you for the reviews, @ogrisel, @thomasjpfan and @Micky774!

github-actions bot added cython module:metrics module:neighbors module:utils labels Jun 10, 2022

jjerphan added the No Changelog Needed label Jun 10, 2022

jjerphan mentioned this pull request Jun 10, 2022

[WIP] ENH Implement sparse support jjerphan/scikit-learn#4

Closed

6 tasks

jjerphan added 4 commits June 13, 2022 18:05

MAINT Implement CSR support for all DistanceMetric

b8bd875

Merge branch 'main' into maint/dist-metrics-csr-support

7b07188

TST Remove useless guard

fb99680

TST Skip JaccardDistance on 32bit architecture

d39d2b2

jjerphan changed the title ~~POC Sparse support for PairwiseDistancesReduction~~ MAINT Sparse support for PairwiseDistancesReduction Jun 15, 2022

jjerphan changed the title ~~MAINT Sparse support for PairwiseDistancesReduction~~ MAINT Fused sparse-dense support for PairwiseDistancesReduction Jun 15, 2022

jjerphan and others added 18 commits June 16, 2022 16:49

MAINT Define dtype alias for sparse matrices indices

011e2a2

MAINT Do not shadow dtype names in Tempita templating

a579630

fixup! MAINT Define dtype alias for sparse matrices indices

98e9d21

TST Use cdist and pdist appropriately

8aa4e44

DOC Improve comments

9edfa11

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Fixups

ee5c6bf

MAINT Wrap of indptr values to support sparse-dense

bf5eb59

This is kind of an hack for now. IMO, it would be better to use a flatiter on a view if possible. See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Apply review comments

92b8a6c

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

More interesting boolean data for tests

dc6f8cf

FIX Various corrections

bb06f59

FIX Make Jaccard, Hamming and Hashing robust to explicit zeros

a5eb20d

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

FIX Make the other boolean DistanceMetric also robust to explicit zeros

19edf11

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

TST Remove xfail for Jaccard on 32bit arch.

de86802

Cast to np.float64_t where appropriate

bb920cf

Rename methods and correctly format their signatures

b3759fe

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

fixup! TST Remove xfail for Jaccard on 32bit arch.

7f89236

FEA CSR support for HaversineDistance

01a0c33

Fix typo

7d8a717

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan added 6 commits September 12, 2022 11:24

DOC Update whats_new entry

8f43a5a

Test and document Isomap on sparse data

a229b35

Test and document TSNE on sparse data

3e357a2

Test and document pairwise_distances_argmin on sparse data

9723115

Test and document LocalOutlierFactor on sparse data

faf704a

DOC Add support for sparse data for NearestNeighbors, KNeighbors*, Ra…

c66bb82

…diusNeighbors* methods

jjerphan commented Sep 12, 2022

View reviewed changes

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

DOC Remove formatting change

1eb5b2c

ogrisel reviewed Sep 12, 2022

View reviewed changes

sklearn/neighbors/tests/test_lof.py Show resolved Hide resolved

sklearn/metrics/tests/test_pairwise_distances_reduction.py Outdated Show resolved Hide resolved

jjerphan and others added 5 commits September 15, 2022 13:55

TST Do not test on full cartesian product

fcf15b6

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fixup! TST Do not test on full cartesian product

58453d7

TST Add TODO for consistency checks on results for sparse and dense data

63fda8c

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

MAINT Mark PairwiseDistancesReductions as unusable for some config.

1d7bcc7

fixup! MAINT Mark PairwiseDistancesReductions as unusable for some co…

fec55bf

…nfig.

ogrisel approved these changes Sep 16, 2022

View reviewed changes

ogrisel added the Waiting for Reviewer label Sep 16, 2022

Micky774 approved these changes Sep 19, 2022

View reviewed changes

jjerphan and others added 2 commits September 20, 2022 09:56

TST Improve test_format_agnosticism

d55bcec

…and add a space somewhere for proper formatting. Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>

DOC Update comment regarding the use of pairwise_distances_chunked

c21187a

See: `BaseDistanceReductionDispatcher.valid_metrics` Co-authored-by: Meekail Zain <Micky774@users.noreply.github.com>

ogrisel merged commit 60cc5b5 into scikit-learn:main Sep 20, 2022

jjerphan deleted the maint/pdr-sparse-support branch September 20, 2022 10:21

Vincent-Maladiere mentioned this pull request Sep 30, 2022

EHN Optimized CSR-CSR support for Euclidean specializations of PairwiseDistancesReductions #24556

Merged

jjerphan mentioned this pull request Mar 9, 2023

ENH Add the fused CSR dense case for Euclidean Specializations #25044

Merged

4 tasks

jjerphan mentioned this pull request Aug 5, 2023

MAINT Make ArgKminClassMode accept sparse datasets #27018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FEA Fused sparse-dense support for `PairwiseDistancesReduction` #23585

FEA Fused sparse-dense support for `PairwiseDistancesReduction` #23585

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

	# TODO: once ArgKmin supports sparse input matrices and 32 bit,
	# we won't need to fallback to pairwise_distances_chunked anymore.

Uh oh!

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

FEA Fused sparse-dense support for `PairwiseDistancesReduction` #23585

FEA Fused sparse-dense support for `PairwiseDistancesReduction` #23585