10000 knn predict unreasonably slow b/c of use of scipy.stats.mode · Issue #13783 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue May 3, 2019 · 11 comments · Fixed by #24076
Closed

knn predict unreasonably slow b/c of use of scipy.stats.mode #13783

amueller opened this issue May 3, 2019 · 11 comments · Fixed by #24076

Comments

@amueller
Copy link
Member
amueller commented May 3, 2019
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

X, y = make_blobs(centers=2, random_state=4, n_samples=30)
knn = KNeighborsClassifier(algorithm='kd_tree').fit(X, y)

x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()

xx = np.linspace(x_min, x_max, 1000)
# change 100 to 1000 below and wait a long time                                          
yy = np.linspace(y_min, y_max, 100)                                          

X1, X2 = np.meshgrid(xx, yy)                                                  
X_grid = np.c_[X1.ravel(), X2.ravel()]                                        
decision_values = knn.predict(X_grid)

spends all it's time in unique within stats.mode, not within the distance calculation. mode runs unique for every row.
I'm pretty sure we can replace the call to mode by some call to making a csr matrix and then argmax.

How much is it worth optimizing this? I feel KNN should be fast in low dimensions and people might actually use this. Having the bottleneck in the wrong place just feels wrong to me ;)

@aditya1702
Copy link
Contributor

@amueller Do you mean something like this

max(top_k, key = list(top_k).count)

@jnothman
Copy link
Member
jnothman commented May 4, 2019 via email

@amueller
Copy link
Member Author

Pretty sure doing a CSR construction would speed it up by several orders of magnitude.

@aditya1702
Copy link
Contributor

@amueller @jnothman Is this something I can try?

@jnothman
Copy link
Member
jnothman commented May 27, 2019 via email

@aditya1702
Copy link
Contributor

Cool! Will try it out

@rth
Copy link
Member
rth commented Aug 1, 2019

Proposed a WIP solution in #14543

@jnothman
Copy link
Member
jnothman commented Aug 2, 2019

At #9597 (comment), @TomDLT pointed out that argmax of predict_proba is faster than the current predict implementation. Any proposal here should compare to using that approach (not yet implemented there) and avoiding mode altogether.

@rth
Copy link
Member
rth commented Aug 6, 2019

At #9597 (comment), @TomDLT pointed out that argmax of predict_proba is faster than the current predict implementation. Any proposal here should compare to using that approach (not yet implemented there) and avoiding mode altogether.

Yes, I'm happy with #9597 and using the argmax as well. Will try to make some benchmarks.

@PaleNeutron
Copy link
PaleNeutron commented Aug 26, 2020

I found that predict_proba's speed is acceptable (at least 10 times faster), maybe we could use a custom function instead before official fix.

def predict(knn, X):
    pro_y = knn.predict_proba(X)
    y = np.argmax(pro_y, axis=1)
    return y

@rth
Copy link
Member
rth commented Aug 26, 2020

Yes, I need to finish #14543 to fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
0