10000 PERF Implement `PairwiseDistancesReduction` backend for `KNeighbors.predict_proba` by Micky774 · Pull Request #24076 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

PERF Implement PairwiseDistancesReduction backend for KNeighbors.predict_proba #24076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 73 commits into from
Mar 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
8a99217
Partial implementation, still broken
Micky774 Jul 29, 2022
c071338
Major update, mostly working
Micky774 Jul 31, 2022
cbe5347
Merge branch 'main' into pwd_kncp
Micky774 Aug 1, 2022
a995220
Completed X-parallel implementation
Micky774 Aug 1, 2022
052e59b
Improved documentation
Micky774 Aug 1, 2022
fc8d495
First batch of review feedback
Micky774 Aug 4, 2022
6334581
Created inline helper function for weighted mode
Micky774 Aug 4, 2022
6f45f79
Merge branch 'main' into pwd_kncp
Micky774 Aug 4, 2022
db31c78
Multioutput support with probabilities output refactor
Micky774 Aug 5, 2022
b96cab4
Merge branch 'main' into pwd_kncp
Micky774 Aug 5, 2022
59f60ed
Code simplification and cleanup
Micky774 Aug 5, 2022
a343b58
Code simplification
Micky774 Aug 9, 2022
e419f96
Merge branch 'main' into pwd_kncp
Micky774 Aug 9, 2022
c01d37f
Update sklearn/metrics/_pairwise_distances_reduction/_argkminlabels.pyx
Micky774 Aug 9, 2022
2b128ea
Merge branch 'main' into pwd_kncp
Micky774 Aug 10, 2022
efaa053
Merge branch 'pwd_kncp' of https://github.com/Micky774/scikit-learn i…
Micky774 Aug 10, 2022
7aff02b
Fixed implementation
Micky774 Aug 10, 2022
91a5b25
Removed extraneous extern statements
Micky774 Aug 10, 2022
2fdbd35
Merge branch 'main' into pwd_kncp
Micky774 Aug 11, 2022
941d42c
Updated to adopt templating for float32
Micky774 Aug 11, 2022
57ffa8b
Removed now-generated file
Micky774 Aug 11, 2022
883152e
Merge branch 'main' into pwd_kncp
Micky774 Aug 24, 2022
003e1bf
Merge branch 'main' into pwd_kncp
Micky774 Aug 24, 2022
0354d73
Merge branch 'main' into pwd_kncp
Micky774 Oct 4, 2022
6070a3d
Merge branch 'main' into pwd_kncp
jjerphan Nov 16, 2022
c5ce440
Merge branch 'main' into pwd_kncp
Micky774 Dec 7, 2022
7b6eae7
Added Euclidean specialization
Micky774 Dec 7, 2022
8dcd3c9
Updated typing for `labels` and associated variables
Micky774 Dec 8, 2022
3a2564a
Merge branch 'main' into pwd_kncp
Micky774 Dec 8, 2022
5717e34
Merge branch 'main' into pwd_kncp
jjerphan Dec 8, 2022
f556084
Reduced labels/indices from int64 to int32
Micky774 Dec 8, 2022
9c35e6e
Merge branch 'pwd_kncp' of https://github.com/Micky774/scikit-learn i…
Micky774 Dec 8, 2022
73012f4
Merge branch 'main' into pwd_kncp
Micky774 Dec 8, 2022
f47bab2
Revert "Reduced labels/indices from int64 to int32"
Micky774 Dec 9, 2022
9c2b7b4
Updated metric resolution and fixed dtype
Micky774 Dec 11, 2022
7a8fc6c
Corrected dtype of probability array
Micky774 Dec 14, 2022
e20146d
Apply suggestions from code review
Micky774 Jan 29, 2023
0053299
Updated with remaining feedback
Micky774 Jan 29, 2023
3fc5733
Reverted for multi-output
Micky774 Jan 29, 2023
47a5e43
Changed `is_usable_for` to disable in euclidean case
Micky774 Jan 29, 2023
dcdeb4e
Merge branch 'main' into pwd_kncp
Micky774 Jan 29, 2023
67ce048
TST Improve implementations and test coverage
jjerphan Feb 2, 2023
08b79d8
Merge branch 'main' into pwd_kncp
Micky774 Feb 3, 2023
bb17a45
Updated with feedback
Micky774 Feb 4, 2023
4d41a55
Merge pull request #9 from jjerphan/pwd_kncp
Micky774 Feb 4, 2023
ac0bb5d
Updated with `unique_labels` argument
Micky774 Feb 4, 2023
df7f8c1
Corrected `labels` dtype in private test on `ArgKminLabels` directly
Micky774 Feb 5, 2023
446e156
Moved casting to `compute` method
Micky774 Feb 6, 2023
e5c9890
Rename ArgKminLabels to ArgKminClassMode
jjerphan Feb 8, 2023
52452d0
Use self._fit_method instead of self.algorithm
jjerphan Feb 8, 2023
caf8504
Remove the old comment
jjerphan Feb 8, 2023
f321fe5
Merge pull request #10 from jjerphan/pwd_kncp
Micky774 Feb 8, 2023
8a4b5a3
Fix typo in module import
ogrisel Feb 8, 2023
e48a5a4
FIX need to call check_is_fitted in the predict method
ogrisel Feb 8, 2023
7536888
Implemented feedback and removed extraneous label mapping
Micky774 Feb 8, 2023
bb5a09b
Removed secondary validation
Micky774 Feb 8, 2023
15bf712
Added changelog
Micky774 Feb 9, 2023
95239d9
Merge branch 'main' into pwd_kncp
Micky774 Feb 13, 2023
025116a
Updated variable names per feedback
Micky774 Feb 13, 2023
1d40084
Merge branch 'main' into pwd_kncp
Micky774 Feb 24, 2023
9de740b
Removed extraneous sort
Micky774 Feb 24, 2023
11b3106
Update doc/whats_new/v1.3.rst
Micky774 Feb 24, 2023
3bc9e2b
Fixed inconsistency between strategies
Micky774 Feb 26, 2023
ca438ee
Altered strategy in response to benchmarks
Micky774 Feb 26, 2023
8f4b371
Fixed when validation occurs, avoiding accidental double validation
Micky774 Feb 27, 2023
a1370fd
Merge branch 'main' into pwd_kncp
Micky774 Feb 27, 2023
6551d86
Apply suggestions from code review
Micky774 Mar 10, 2023
f52468d
Updated with review feedback
Micky774 Mar 10, 2023
e403681
Update sklearn/neighbors/_classification.py
Micky774 Mar 10, 2023
25d142f
Updated .gitignore
Micky774 Mar 10, 2023
1c0d1e5
Merge branch 'main' into pwd_kncp
Micky774 Mar 12, 2023
6f4433d
Updated to reconcile new name
Micky774 Mar 12, 2023
d13793d
Qualify cdef methods with "noexcept nogil"
jjerphan Mar 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ sklearn/metrics/_dist_metrics.pyx
sklearn/metrics/_dist_metrics.pxd
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx
sklearn/metrics/_pairwise_distances_reduction/_argkmin_classmode.pyx
sklearn/metrics/_pairwise_distances_reduction/_base.pxd
sklearn/metrics/_pairwise_distances_reduction/_base.pyx
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd
Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,11 @@ Changelog
dissimilarity is not a metric and cannot be supported by the BallTree.
:pr:`25417` by :user:`Guillaume Lemaitre <glemaitre>`.

- |Enhancement| The performance of :meth:`neighbors.KNeighborsClassifier.predict`
and of :meth:`neighbors.KNeighborsClassifier.predict_proba` has been improved
when `n_neighbors` is large and `algorithm="brute"` with non Euclidean metrics.
:pr:`24076` by :user:`Meekail Zain <micky774>`, :user:`Julien Jerphanion <jjerphan>`.

:mod:`sklearn.neural_network`
.............................

Expand Down
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ ignore =
sklearn/metrics/_dist_metrics.pxd
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd
sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx
sklearn/metrics/_pairwise_distances_reduction/_argkmin_classmode.pyx
sklearn/metrics/_pairwise_distances_reduction/_base.pxd
sklearn/metrics/_pairwise_distances_reduction/_base.pyx
sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd
Expand Down
6 changes: 6 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,12 @@ def check_package_status(package, min_version):
"include_np": True,
"extra_compile_args": ["-std=c++11"],
},
{
"sources": ["_argkmin_classmode.pyx.tp"],
"language": "c++",
"include_np": True,
"extra_compile_args": ["-std=c++11"],
},
{
"sources": ["_radius_neighbors.pyx.tp", "_radius_neighbors.pxd.tp"],
"language": "c++",
Expand Down
2 changes: 2 additions & 0 deletions sklearn/metrics/_pairwise_distances_reduction/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,12 +90,14 @@
BaseDistancesReductionDispatcher,
ArgKmin,
RadiusNeighbors,
ArgKminClassMode,
sqeuclidean_row_norms,
)

__all__ = [
"BaseDistancesReductionDispatcher",
"ArgKmin",
"RadiusNeighbors",
"ArgKminClassMode",
"sqeuclidean_row_norms",
]
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,6 @@ cdef class ArgKmin{{name_suffix}}(BaseDistancesReduction{{name_suffix}}):
self.heaps_r_distances_chunks[thread_num] = &self.argkmin_distances[X_start, 0]
self.heaps_indices_chunks[thread_num] = &self.argkmin_indices[X_start, 0]

@final
cdef void _parallel_on_X_prange_iter_finalize(
self,
ITYPE_t thread_num,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
{{py:

implementation_specific_values = [
# Values are the following ones:
#
# name_suffix, INPUT_DTYPE_t, INPUT_DTYPE
#
# We also use the float64 dtype and C-type names as defined in
# `sklearn.utils._typedefs` to maintain consistency.
#
('64', 'DTYPE_t', 'DTYPE'),
('32', 'cnp.float32_t', 'np.float32')
]

}}

from cython cimport floating, integral
from cython.parallel cimport parallel, prange
from libcpp.map cimport map as cpp_map, pair as cpp_pair
from libc.stdlib cimport free

cimport numpy as cnp

cnp.import_array()

from ...utils._typedefs cimport ITYPE_t, DTYPE_t
from ...utils._typedefs import ITYPE, DTYPE
import numpy as np
from scipy.sparse import issparse
from sklearn.utils.fixes import threadpool_limits

cpdef enum WeightingStrategy:
uniform = 0
# TODO: Implement the following options, most likely in
# `weighted_histogram_mode`
distance = 1
callable = 2

{{for name_suffix, INPUT_DTYPE_t, INPUT_DTYPE in implementation_specific_values}}
from ._argkmin cimport ArgKmin{{name_suffix}}
from ._datasets_pair cimport DatasetsPair{{name_suffix}}

cdef class ArgKminClassMode{{name_suffix}}(ArgKmin{{name_suffix}}):
"""
{{name_suffix}}bit implementation of ArgKminClassMode.
"""
cdef:
const ITYPE_t[:] class_membership,
const ITYPE_t[:] unique_labels
DTYPE_t[:, :] class_scores
cpp_map[ITYPE_t, ITYPE_t] labels_to_index
WeightingStrategy weight_type

@classmethod
def compute(
cls,
X,
Y,
ITYPE_t k,
weights,
class_membership,
unique_labels,
str metric="euclidean",
chunk_size=None,
dict metric_kwargs=None,
str strategy=None,
):
"""Compute the argkmin reduction with class_membership.

This classmethod is responsible for introspecting the arguments
values to dispatch to the most appropriate implementation of
:class:`ArgKminClassMode{{name_suffix}}`.

This allows decoupling the API entirely from the implementation details
whilst maintaining RAII: all temporarily allocated datastructures necessary
for the concrete implementation are therefore freed when this classmethod
returns.

No instance _must_ directly be created outside of this class method.
"""
# Use a generic implementation that handles most scipy
# metrics by computing the distances between 2 vectors at a time.
pda = ArgKminClassMode{{name_suffix}}(
datasets_pair=DatasetsPair{{name_suffix}}.get_for(X, Y, metric, metric_kwargs),
k=k,
chunk_size=chunk_size,
strategy=strategy,
weights=weights,
class_membership=class_membership,
unique_labels=unique_labels,
)

# Limit the number of threads in second level of nested parallelism for BLAS
# to avoid threads over-subscription (in GEMM for instance).
with threadpool_limits(limits=1, user_api="blas"):
if pda.execute_in_parallel_on_Y:
pda._parallel_on_Y()
else:
pda._parallel_on_X()

return pda._finalize_results()

def __init__(
self,
DatasetsPair{{name_suffix}} datasets_pair,
const ITYPE_t[:] class_membership,
const ITYPE_t[:] unique_labels,
chunk_size=None,
strategy=None,
ITYPE_t k=1,
weights=None,
):
super().__init__(
datasets_pair=datasets_pair,
chunk_size=chunk_size,
strategy=strategy,
k=k,
)

if weights == "uniform":
self.weight_type = WeightingStrategy.uniform
elif weights == "distance":
self.weight_type = WeightingStrategy.distance
else:
self.weight_type = WeightingStrategy.callable
self.class_membership = class_membership

self.unique_labels = unique_labels

cdef ITYPE_t idx, neighbor_class_idx
# Map from set of unique labels to their indices in `class_scores`
# Buffer used in building a histogram for one-pass weighted mode
self.class_scores = np.zeros(
(self.n_samples_X, unique_labels.shape[0]), dtype=DTYPE,
)

def _finalize_results(self):
probabilities = np.asarray(self.class_scores)
probabilities /= probabilities.sum(axis=1, keepdims=True)
return probabilities

cdef inline void weighted_histogram_mode(
self,
ITYPE_t sample_index,
ITYPE_t* indices,
DTYPE_t* distances,
) noexcept nogil:
cdef:
ITYPE_t neighbor_idx, neighbor_class_idx, label_index, multi_output_index
DTYPE_t score_incr = 1
# TODO: Implement other WeightingStrategy values
bint use_distance_weighting = (
self.weight_type == WeightingStrategy.distance
)

# Iterate through the sample k-nearest neighbours
for neighbor_rank in range(self.k):
# Absolute indice of the neighbor_rank-th Nearest Neighbors
# in range [0, n_samples_Y)
# TODO: inspect if it worth permuting this condition
# and the for-loop above for improved branching.
if use_distance_weighting:
score_incr = 1 / distances[neighbor_rank]
neighbor_idx = indices[neighbor_rank]
neighbor_class_idx = self.class_membership[neighbor_idx]
self.class_scores[sample_index][neighbor_class_idx] += score_incr
return

cdef void _parallel_on_X_prange_iter_finalize(
self,
ITYPE_t thread_num,
ITYPE_t X_start,
ITYPE_t X_end,
) noexcept nogil:
cdef:
ITYPE_t idx, sample_index
for idx in range(X_end - X_start):
# One-pass top-one weighted mode
# Compute the absolute index in [0, n_samples_X)
sample_index = X_start + idx
self.weighted_histogram_mode(
sample_index,
&self.heaps_indices_chunks[thread_num][idx * self.k],
&self.heaps_r_distances_chunks[thread_num][idx * self.k],
)
return

cdef void _parallel_on_Y_finalize(
self,
) noexcept nogil:
cdef:
ITYPE_t sample_index, thread_num

with nogil, parallel(num_threads=self.chunks_n_threads):
# Deallocating temporary datastructures
for thread_num in prange(self.chunks_n_threads, schedule='static'):
free(self.heaps_r_distances_chunks[thread_num])
free(self.heaps_indices_chunks[thread_num])

for sample_index in prange(self.n_samples_X, schedule='static'):
self.weighted_histogram_mode(
sample_index,
&self.argkmin_indices[sample_index][0],
&self.argkmin_distances[sample_index][0],
)
return

{{endfor}}
Loading
0