knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

amueller · 2019-05-03T21:17:21Z

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(centers=2, random_state=4, n_samples=30)
knn = KNeighborsClassifier(algorithm='kd_tree').fit(X, y)

x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()

xx = np.linspace(x_min, x_max, 1000)
# change 100 to 1000 below and wait a long time                                          
yy = np.linspace(y_min, y_max, 100)                                          

X1, X2 = np.meshgrid(xx, yy)                                                  
X_grid = np.c_[X1.ravel(), X2.ravel()]                                        
decision_values = knn.predict(X_grid)

spends all it's time in unique within stats.mode, not within the distance calculation. mode runs unique for every row.
I'm pretty sure we can replace the call to mode by some call to making a csr matrix and then argmax.

How much is it worth optimizing this? I feel KNN should be fast in low dimensions and people might actually use this. Having the bottleneck in the wrong place just feels wrong to me ;)

The text was updated successfully, but these errors were encountered:

aditya1702 · 2019-05-04T01:20:47Z

@amueller Do you mean something like this

max(top_k, key = list(top_k).count)

jnothman · 2019-05-04T08:22:17Z

That isn't going to apply to every row, and involves n_classes passes over each. Basically because we know the set of class labels, we shouldn't need to be doing unique. Yes, we could construct a CSR sparse matrix and sum_duplicates and get the argmax. Or we could just run bincount and argmax for each row. The question is if it speeds things up enough to be worth the code. It might also be possible to use np.add.at to effectively do bincount in 2d...?

amueller · 2019-05-21T16:57:22Z

Pretty sure doing a CSR construction would speed it up by several orders of magnitude.

aditya1702 · 2019-05-27T13:54:51Z

@amueller @jnothman Is this something I can try?

jnothman · 2019-05-27T14:46:57Z

You're welcome to submit a pull request!

aditya1702 · 2019-05-28T13:04:02Z

Cool! Will try it out

rth · 2019-08-01T16:26:25Z

Proposed a WIP solution in #14543

jnothman · 2019-08-02T04:47:50Z

At #9597 (comment), @TomDLT pointed out that argmax of predict_proba is faster than the current predict implementation. Any proposal here should compare to using that approach (not yet implemented there) and avoiding mode altogether.

rth · 2019-08-06T12:40:55Z

At #9597 (comment), @TomDLT pointed out that argmax of predict_proba is faster than the current predict implementation. Any proposal here should compare to using that approach (not yet implemented there) and avoiding mode altogether.

Yes, I'm happy with #9597 and using the argmax as well. Will try to make some benchmarks.

PaleNeutron · 2020-08-26T05:13:36Z

I found that predict_proba's speed is acceptable (at least 10 times faster), maybe we could use a custom function instead before official fix.

def predict(knn, X):
    pro_y = knn.predict_proba(X)
    y = np.argmax(pro_y, axis=1)
    return y

rth · 2020-08-26T06:41:15Z

Yes, I need to finish #14543 to fix it

amueller added Enhancement help wanted Performance labels May 21, 2019

rth mentioned this issue Aug 1, 2019

WIP PERF Faster KNeighborsClassifier.predict #14543

Closed

2 tasks

webber26232 mentioned this issue Aug 7, 2019

[MRG+1] Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier #9597

Merged

cmarmo removed the help wanted label Aug 23, 2020

cmarmo added the module:neighbors label Feb 25, 2022

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Jun 1, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Jun 1, 2022

Micky774 mentioned this issue Jun 21, 2022

ENH Improve performance of KNeighborsClassifier.predict #23721

Closed

1 task

Micky774 moved this from Todo📬 to In Progress🏗 in Quansight's scikit-learn Project Board Jun 21, 2022

Micky774 mentioned this issue Aug 1, 2022

PERF Implement PairwiseDistancesReduction backend for KNeighbors.predict_proba #24076

Merged

4 tasks

ogrisel closed this as completed in #24076 Mar 14, 2023

github-project-automation bot moved this from In Progress🏗 to Done🚀 in Quansight's scikit-learn Project Board Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

Comments