8000 FEA Add support for float32 on `PairwiseDistancesReduction` using Tem… · scikit-learn/scikit-learn@b7d0171 · GitHub
[go: up one dir, main page]

Skip to content

Commit b7d0171

Browse files
jjerphanthomasjpfanogrisel
authored
FEA Add support for float32 on PairwiseDistancesReduction using Tempita (#23865)
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
1 parent 5f604b7 commit b7d0171

21 files changed

+719
-227
lines changed

.gitignore

Lines changed: 10 additions & 0 deletions
8000
Original file line numberDiff line numberDiff line change
@@ -87,3 +87,13 @@ sklearn/utils/_weight_vector.pxd
8787
sklearn/linear_model/_sag_fast.pyx
8888
sklearn/metrics/_dist_metrics.pyx
8989
sklearn/metrics/_dist_metrics.pxd
90+
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd
91+
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx
92+
sklearn/metrics/_pairwise_distances_reduction/_base.pxd
93+
sklearn/metrics/_pairwise_distances_reduction/_base.pyx
94+
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd
95+
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx
96+
sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pxd
97+
sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx
98+
sklearn/metrics/_pairwise_distances_reduction/_radius_neighborhood.pxd
99+
sklearn/metrics/_pairwise_distances_reduction/_radius_neighborhood.pyx

doc/whats_new/v1.1.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,11 @@ Changelog
296296

297297
For instance :class:`sklearn.neighbors.NearestNeighbors.kneighbors` and
298298
:class:`sklearn.neighbors.NearestNeighbors.radius_neighbors`
299-
can respectively be up to ×20 and ×5 faster than previously.
299+
can respectively be up to ×20 and ×5 faster than previously on a laptop.
300+
301+
Moreover, implementations of those two algorithms are now suitable
302+
for machine with many cores, making them usable for datasets consisting
303+
of millions of samples.
300304

301305
:pr:`21987`, :pr:`22064`, :pr:`22065`, :pr:`22288` and :pr:`22320`
302306
by :user:`Julien Jerphanion <jjerphan>`.

doc/whats_new/v1.2.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,42 @@ Changelog
7373
parameter `base_estimator` is deprecated and will be removed in 1.4.
7474
:pr:`22054` by :user:`Kevin Roice <kevroi>`.
7575

76+
- |Efficiency| Low-level routines for reductions on pairwise distances
77+
for dense float32 datasets have been refactored. The following functions
78+
and estimators now benefit from improved performances in terms of hardware
79+
scalability and speed-ups:
80+
81+
- :func:`sklearn.metrics.pairwise_distances_argmin`
82+
- :func:`sklearn.metrics.pairwise_distances_argmin_min`
83+
- :class:`sklearn.cluster.AffinityPropagation`
84+
- :class:`sklearn.cluster.Birch`
85+
- :class:`sklearn.cluster.MeanShift`
86+
- :class:`sklearn.cluster.OPTICS`
87+
- :class:`sklearn.cluster.SpectralClustering`
88+
- :func:`sklearn.feature_selection.mutual_info_regression`
89+
- :class:`sklearn.neighbors.KNeighborsClassifier`
90+
- :class:`sklearn.neighbors.KNeighborsRegressor`
91+
- :class:`sklearn.neighbors.RadiusNeighborsClassifier`
92+
- :class:`sklearn.neighbors.RadiusNeighborsRegressor`
93+
- :class:`sklearn.neighbors.LocalOutlierFactor`
94+
- :class:`sklearn.neighbors.NearestNeighbors`
95+
- :class:`sklearn.manifold.Isomap`
96+
- :class:`sklearn.manifold.LocallyLinearEmbedding`
97+
- :class:`sklearn.manifold.TSNE`
98+
- :func:`sklearn.manifold.trustworthiness`
99+
- :class:`sklearn.semi_supervised.LabelPropagation`
100+
- :class:`sklearn.semi_supervised.LabelSpreading`
101+
102+
For instance :class:`sklearn.neighbors.NearestNeighbors.kneighbors` and
103+
:class:`sklearn.neighbors.NearestNeighbors.radius_neighbors`
104+
can respectively be up to ×20 and ×5 faster than previously on a laptop.
105+
106+
Moreover, implementations of those two algorithms are now suitable
107+
for machine with many cores, making them usable for datasets consisting
108+
of millions of samples.
109+
110+
:pr:`23865` by :user:`Julien Jerphanion <jjerphan>`.
111+
76112
:mod:`sklearn.cluster`
77113
......................
78114

setup.cfg

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,17 @@ ignore =
6767
sklearn/utils/_weight_vector.pxd
6868
sklearn/metrics/_dist_metrics.pyx
6969
sklearn/metrics/_dist_metrics.pxd
70+
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd
71+
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx
72+
sklearn/metrics/_pairwise_distances_reduction/_base.pxd
73+
sklearn/metrics/_pairwise_distances_reduction/_base.pyx
74+
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd
75+
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx
76+
sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pxd
77+
sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx
78+
sklearn/metrics/_pairwise_distances_reduction/_radius_neighborhood.pxd
79+
sklearn/metrics/_pairwise_distances_reduction/_radius_neighborhood.pyx
80+
7081

7182
[codespell]
7283
skip = ./.git,./.mypy_cache,./doc/themes/scikit-learn-modern/static/js,./sklearn/feature_extraction/_stop_words.py,./doc/_build,./doc/auto_examples,./doc/modules/generated

sklearn/manifold/tests/test_t_sne.py

Lines changed: 43 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1048,32 +1048,54 @@ def test_gradient_bh_multithread_match_sequential():
10481048
assert_allclose(grad_multithread, grad_multithread)
10491049

10501050

1051-
def test_tsne_with_different_distance_metrics():
1051+
@pytest.mark.parametrize(
1052+
"metric, dist_func",
1053+
[("manhattan", manhattan_distances), ("cosine", cosine_distances)],
1054+
)
1055+
@pytest.mark.parametrize("method", ["barnes_hut", "exact"])
1056+
def test_tsne_with_different_distance_metrics(metric, dist_func, method):
10521057
"""Make sure that TSNE works for different distance metrics"""
1058+
1059+
if method == "barnes_hut" and metric == "manhattan":
1060+
# The distances computed by `manhattan_distances` differ slightly from those
1061+
# computed internally by NearestNeighbors via the PairwiseDistancesReduction
1062+
# Cython code-based. This in turns causes T-SNE to converge to a different
1063+
# solution but this should not impact the qualitative results as both
1064+
# methods.
1065+
# NOTE: it's probably not valid from a mathematical point of view to use the
1066+
# Manhattan distance for T-SNE...
1067+
# TODO: re-enable this test if/when `manhattan_distances` is refactored to
1068+
# reuse the same underlying Cython code NearestNeighbors.
1069+
# For reference, see:
1070+
# https://github.com/scikit-learn/scikit-learn/pull/23865/files#r925721573
1071+
pytest.xfail(
1072+
"Distance computations are different for method == 'barnes_hut' and metric"
1073+
" == 'manhattan', but this is expected."
1074+
)
1075+
10531076
random_state = check_random_state(0)
10541077
n_components_original = 3
10551078
n_components_embedding = 2
10561079
X = random_state.randn(50, n_components_original).astype(np.float32)
1057-
metrics = ["manhattan", "cosine"]
1058-
dist_funcs = [manhattan_distances, cosine_distances]
1059-
for metric, dist_func in zip(metrics, dist_funcs):
1060-
X_transformed_tsne = TSNE(
1061-
metric=metric,
1062-
n_components=n_components_embedding,
1063-
random_state=0,
1064-
n_iter=300,
1065-
init="random",
1066-
learning_rate="auto",
1067-
).fit_transform(X)
1068-
X_transformed_tsne_precomputed = TSNE(
1069-
metric="precomputed",
1070-
n_components=n_components_embedding,
1071-
random_state=0,
1072-
n_iter=300,
1073-
init="random",
1074-
learning_rate="auto",
1075-
).fit_transform(dist_func(X))
1076-
assert_array_equal(X_transformed_tsne, X_transformed_tsne_precomputed)
1080+
X_transformed_tsne = TSNE(
1081+
metric=metric,
1082+
method=method,
1083+
n_components=n_components_embedding,
1084+
random_state=0,
1085+
n_iter=300,
1086+
init="random",
1087+
learning_rate="auto",
1088+
).fit_transform(X)
1089+
X_transformed_tsne_precomputed = TSNE(
1090+
metric="precomputed",
1091+
method=method,
1092+
n_components=n_components_embedding,
1093+
random_state=0,
1094+
n_iter=300,
1095+
init="random",
1096+
learning_rate="auto",
1097+
).fit_transform(dist_func(X))
1098+
assert_array_equal(X_transformed_tsne, X_transformed_tsne_precomputed)
10771099

10781100

10791101
# TODO: Remove in 1.2

sklearn/metrics/_dist_metrics.pxd.tp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ cdef inline DTYPE_t euclidean_rdist_to_dist{{name_suffix}}(const {{INPUT_DTYPE_t
7272

7373

7474
######################################################################
75-
# DistanceMetric base class
75+
# DistanceMetric{{name_suffix}} base class
7676
cdef class DistanceMetric{{name_suffix}}:
7777
# The following attributes are required for a few of the subclasses.
7878
# we must define them here so that cython's limited polymorphism will work.

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd

Lines changed: 0 additions & 33 deletions
This file was deleted.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
{{py:
2+
3+
implementation_specific_values = [
4+
# Values are the following ones:
5+
#
6+
# name_suffix, INPUT_DTYPE_t, INPUT_DTYPE
7+
#
8+
# We also use the float64 dtype and C-type names as defined in
9+
# `sklearn.utils._typedefs` to maintain consistency.
10+
#
11+
('64', 'cnp.float64_t', 'np.float64'),
12+
('32', 'cnp.float32_t', 'np.float32')
13+
]
14+
15+
}}
16+
17+
cimport numpy as cnp
18+
from ...utils._typedefs cimport ITYPE_t, DTYPE_t
19+
20+
cnp.import_array()
21+
22+
{{for name_suffix, INPUT_DTYPE_t, INPUT_DTYPE in implementation_specific_values}}
23+
24+
from ._base cimport PairwiseDistancesReduction{{name_suffix}}
25+
from ._gemm_term_computer cimport GEMMTermComputer{{name_suffix}}
26+
27+
cdef class PairwiseDistancesArgKmin{{name_suffix}}(PairwiseDistancesReduction{{name_suffix}}):
28+
"""{{name_suffix}}bit implementation of PairwiseDistancesArgKmin."""
29+
30+
cdef:
31+
ITYPE_t k
32+
33+
ITYPE_t[:, ::1] argkmin_indices
34+
DTYPE_t[:, ::1] argkmin_distances
35+
36+
# Used as array of pointers to private datastructures used in threads.
37+
DTYPE_t ** heaps_r_distances_chunks
38+
ITYPE_t ** heaps_indices_chunks
39+
40+
41+
cdef class FastEuclideanPairwiseDistancesArgKmin{{name_suffix}}(PairwiseDistancesArgKmin{{name_suffix}}):
42+
"""EuclideanDistance-specialized {{name_suffix}}bit implementation for PairwiseDistancesArgKmin."""
43+
cdef:
44+
GEMMTermComputer{{name_suffix}} gemm_term_computer
45+
const DTYPE_t[::1] X_norm_squared
46+
const DTYPE_t[::1] Y_norm_squared
47+
48+
bint use_squared_distances
49+
50+
{{endfor}}

0 commit comments

Comments
 (0)
0