8000 ENH Introduce `PairwiseDistancesReduction` and `PairwiseDistancesArgK… · scikit-learn/scikit-learn@6a16763 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6a16763

Browse files
jjerphanthomasjpfanlorentzenchrogriseljeremiedbb
authored
ENH Introduce PairwiseDistancesReduction and PairwiseDistancesArgKmin (feature branch) (#22134)
* MAINT Introduce Pairwise Distances Reductions private submodule (#22064) * MAINT Introduce FastEuclideanPairwiseArgKmin (#22065) * fixup! Merge branch 'main' into pairwise-distances-argkmin Remove duplicated Bunch * MAINT Plug `PairwiseDistancesArgKmin` as a back-end (#22288) * Forward pairwise_dist_chunk_size in the configuration * Flip finalized results for PairwiseDistancesArgKmin The previous would have made the code more complex by introducing some boilerplate for the interface plugs. Having it this way actually simplifies the code. This also removes the haversine branch for test_pairwise_distances_argkmin * Plug PairwiseDistancesArgKmin as a back-end * Adapt test accordingly * Add whats_new entry * Change input validation order for kneighbors * Remove duplicated test_neighbors_distance_metric_deprecation * Adapt the documentation * Add mahalanobis case to test fixtures * Correct whats_new entry * CLN Remove unneeded private metric attribute This was needed when 'fast_sqeuclidean' and 'fast_euclidean' were present to choose the best implementation based on the user specification. Those metric have been removed since then, making this attribute useless. * TST Assert FutureWarning instead of DeprecationWarning in test_neighbors_metrics * MAINT Add use_pairwise_dist_activate to scikit-learn config * TST Add a test for the 'brute' backends' results' consistency Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> * fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config * fixup! fixup! MAINT Add use_pairwise_dist_activate to scikit-learn config * TST Filter FutureWarning for WMinkowskiDistance * MAINT pin numpydoc in arm for now (#22292) * fixup! TST Filter FutureWarning for WMinkowskiDistance * Revert keywords arguments removal for the GEMM trick for 'euclidean' * MAINT pin max numpydoc for now (#22286) * Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS * fixup! Add 'haversine' to CDIST_PAIRWISE_DISTANCES_REDUCTION_COMMON_METRICS * Apply suggestions from code review * MAINT Document some config parameters for maintenance Also rename one of them. * FIX Support and test one of 'sqeuclidean' specification Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> * FIX Various typos fix and correct haversine 'haversine' is not supported by cdist. * Directly use get_config * CLN Apply comments from review * Motivate swapped returned values * TST Remove mahalanobis from test fixtures * MNT Add comment regaduction functions' signatures * TST Complete test for `pairwise_distance_{argmin,argmin_min}` (#22371) * DOC Add sub-pull requests to the whats_new entry * DOC place comment inside functions * DOC move up whatsnew entry Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
1 parent f5f01a3 commit 6a16763

17 files changed

+2062
-60
lines changed

doc/whats_new/v1.1.rst

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,35 @@ Changelog
6565
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
6666
where 123456 is the *pull request* number, not the issue number.
6767
68+
- |Efficiency| Low-level routines for reductions on pairwise distances
69+
for dense float64 datasets have been refactored. The following functions
70+
and estimators now benefit from improved performances, in particular on
71+
multi-cores machines:
72+
- :func:`sklearn.metrics.pairwise_distances_argmin`
73+
- :func:`sklearn.metrics.pairwise_distances_argmin_min`
74+
- :class:`sklearn.cluster.AffinityPropagation`
75+
- :class:`sklearn.cluster.Birch`
76+
- :class:`sklearn.cluster.MeanShift`
77+
- :class:`sklearn.cluster.OPTICS`
78+
- :class:`sklearn.cluster.SpectralClustering`
79+
- :func:`sklearn.feature_selection.mutual_info_regression`
80+
- :class:`sklearn.neighbors.KNeighborsClassifier`
81+
- :class:`sklearn.neighbors.KNeighborsRegressor`
82+
- :class:`sklearn.neighbors.LocalOutlierFactor`
83+
- :class:`sklearn.neighbors.NearestNeighbors`
84+
- :class:`sklearn.manifold.Isomap`
85+
- :class:`sklearn.manifold.LocallyLinearEmbedding`
86+
- :class:`sklearn.manifold.TSNE`
87+
- :func:`sklearn.manifold.trustworthiness`
88+
- :class:`sklearn.semi_supervised.LabelPropagation`
89+
- :class:`sklearn.semi_supervised.LabelSpreading`
90+
91+
For instance :class:`sklearn.neighbors.NearestNeighbors.kneighbors`
92+
can be up to ×20 faster than in the previous versions'.
93+
94+
:pr:`21987`, :pr:`22064`, :pr:`22065` and :pr:`22288`
95+
by :user:`Julien Jerphanion <jjerphan>`
96+
6897
- |Enhancement| All scikit-learn models now generate a more informative
6998
error message when some input contains unexpected `NaN` or infinite values.
7099
In particular the message contains the input name ("X", "y" or

sklearn/_config.py

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@
99
"working_memory": int(os.environ.get("SKLEARN_WORKING_MEMORY", 1024)),
1010
"print_changed_only": True,
1111
"display": "text",
12+
"pairwise_dist_chunk_size": int(
13+
os.environ.get("SKLEARN_PAIRWISE_DIST_CHUNK_SIZE", 256)
14+
),
15+
"enable_cython_pairwise_dist": True,
1216
}
1317
_threadlocal = threading.local()
1418

@@ -40,7 +44,12 @@ def get_config():
4044

4145

4246
def set_config(
43-
assume_finite=None, working_memory=None, print_changed_only=None, display=None
47+
assume_finite=None,
48+
working_memory=None,
49+
print_changed_only=None,
50+
display=None,
51+
pairwise_dist_chunk_size=None,
52+
enable_cython_pairwise_dist=None,
4453
):
4554
"""Set global scikit-learn configuration
4655
@@ -80,6 +89,26 @@ def set_config(
8089
8190
.. versionadded:: 0.23
8291
92+
pairwise_dist_chunk_size : int, default=None
93+
The number of row vectors per chunk for PairwiseDistancesReduction.
94+
Default is 256 (suitable for most of modern laptops' caches and architectures).
95+
96+
Intended for easier benchmarking and testing of scikit-learn internals.
97+
End users are not expected to benefit from customizing this configuration
98+
setting.
99+
100+
.. versionadded:: 1.1
101+
102+
enable_cython_pairwise_dist : bool, default=None
103+
Use PairwiseDistancesReduction when possible.
104+
Default is True.
105+
106+
Intended for easier benchmarking and testing of scikit-learn internals.
107+
End users are not expected to benefit from customizing this configuration
108+
setting.
109+
110+
.. versionadded:: 1.1
111+
83112
See Also
84113
--------
85114
config_context : Context manager for global scikit-learn configuration.
@@ -95,11 +124,21 @@ def set_config(
95124
local_config["print_changed_only"] = print_changed_only
96125
if display is not None:
97126
local_config["display"] = display
127+
if pairwise_dist_chunk_size is not None:
128+
local_config["pairwise_dist_chunk_size"] = pairwise_dist_chunk_size
129+
if enable_cython_pairwise_dist is not None:
130+
local_config["enable_cython_pairwise_dist"] = enable_cython_pairwise_dist
98131

99132

100133
@contextmanager
101134
def config_context(
102-
*, assume_finite=None, working_ F438 memory=None, print_changed_only=None, display=None
135+
*,
136+
assume_finite=None,
137+
working_memory=None,
138+
print_changed_only=None,
139+
display=None,
140+
pairwise_dist_chunk_size=None,
141+
enable_cython_pairwise_dist=None,
103142
):
104143
"""Context manager for global scikit-learn configuration.
105144
@@ -138,6 +177,26 @@ def config_context(
138177
139178
.. versionadded:: 0.23
140179
180+
pairwise_dist_chunk_size : int, default=None
181+
The number of vectors per chunk for PairwiseDistancesReduction.
182+
Default is 256 (suitable for most of modern laptops' caches and architectures).
183+
184+
Intended for easier benchmarking and testing of scikit-learn internals.
185+
End users are not expected to benefit from customizing this configuration
186+
setting.
187+
188+
.. versionadded:: 1.1
189+
190+
enable_cython_pairwise_dist : bool, default=None
191+
Use PairwiseDistancesReduction when possible.
192+
Default is True.
193+
194+
Intended for easier benchmarking and testing of scikit-learn internals.
195+
End users are not expected to benefit from customizing this configuration
196+
setting.
197+
198+
.. versionadded:: 1.1
199+
141200
Yields
142201
------
143202
None.
@@ -171,6 +230,8 @@ def config_context(
171230
working_memory=working_memory,
172231
print_changed_only=print_changed_only,
173232
display=display,
233+
pairwise_dist_chunk_size=pairwise_dist_chunk_size,
234+
enable_cython_pairwise_dist=enable_cython_pairwise_dist,
174235
)
175236

176237
try:

sklearn/metrics/_dist_metrics.pxd

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,3 +64,24 @@ cdef class DistanceMetric:
6464
cdef DTYPE_t _rdist_to_dist(self, DTYPE_t rdist) nogil except -1
6565

6666
cdef DTYPE_t _dist_to_rdist(self, DTYPE_t dist) nogil except -1
67+
68+
69+
######################################################################
70+
# DatasetsPair base class
71+
cdef class DatasetsPair:
72+
cdef DistanceMetric distance_metric
73+
74+
cdef ITYPE_t n_samples_X(self) nogil
75+
76+
cdef ITYPE_t n_samples_Y(self) nogil
77+
78+
cdef DTYPE_t dist(self, ITYPE_t i, ITYPE_t j) nogil
79+
80+
cdef DTYPE_t surrogate_dist(self, ITYPE_t i, ITYPE_t j) nogil
81+
82+
83+
cdef class DenseDenseDatasetsPair(DatasetsPair):
84+
cdef:
85+
const DTYPE_t[:, ::1] X
86+
const DTYPE_t[:, ::1] Y
87+
ITYPE_t d

0 commit comments

Comments
 (0)
0