8000 [MRG+1] Parallel radius neighbors by recamshak · Pull Request #10887 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] Parallel radius neighbors #10887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 16, 2018

Conversation

recamshak
Copy link
Contributor
@recamshak recamshak commented Mar 28, 2018

Reference Issues/PRs

Fixes #8003

What does this implement/fix? Explain your changes.

This makes RadiusNeighborsMixin.radius_neighbors honor the n_jobs argument and split the queries among processors. This also makes query_radius GIL-free so that it is actually faster than single thread.

TODO:

  • fix memory leak by making the indices and distances array own the data
  • fix Windows 32bits issue
  • write tests
  • run some benchmark

Any other comments?

@jnothman
Copy link
Member
jnothman commented Mar 28, 2018 via email

@recamshak
Copy link
Contributor Author

I haven't done a proper benchmark yet but I saw much better runtime on my laptop. I'll do a benchmark today and post the results here.

@recamshak
Copy link
Contributor Author

I ran 10 times the following benchmark on a Google Cloud instance with 64 vCPUs:

from sklearn.datasets import make_blobs
from sklearn.neighbors import NearestNeighbors
from sklearn.externals.joblib import cpu_count
import time

d = make_blobs(100000, 100)[0]
nn = NearestNeighbors().fit(d)

for n_jobs in range(1, cpu_count() + 1):
    nn.n_jobs = n_jobs
    start = time.time()
    nn.radius_neighbors()
    end = time.time()
    print('{},{}'.format(n_jobs, end - start))

Although the scaling is not linear it definitely runs faster:

image

@jnothman
Copy link
Member
jnothman commented Mar 29, 2018 via email

@recamshak
Copy link
Contributor Author

One thing that was not obvious to me but made a huge difference was that in _query_radius_single doing this:

-                        raise ValueError("Fatal: count out of range. "
-                                         "This should never happen.")
+                        with gil:
+                            raise ValueError("Fatal: count out of range. "
+                                             "This should never happen.")

instead of this:

-                        raise ValueError("Fatal: count out of range. "
-                                         "This should never happen.")
+                        return -1

have a completely different scaling behavior.

In both case the function is nogil and I expected that in the first case the GIL would be acquired only if the with gil: statement is reached. But as explained here any function with a with gil: statement will have to acquire the GIL on return, regardless of whether with gil: was reached or not.

@recamshak recamshak changed the title [WIP] Parallel radius neighbors [MRG] Parallel radius neighbors Apr 2, 2018
@recamshak
Copy link
Contributor Author

@jnothman Thank you for your comments. I think this is ready for review.

@jnothman
Copy link
Member
jnothman commented Apr 2, 2018 via email

@TomDLT
Copy link
Member
TomDLT commented Apr 5, 2018

Nice work ! The code seems reasonable, though I am a novice in C memory allocation.

As a general comment, your code could benefit from more comments, especially near non-standard function like np.PyArray_SimpleNewFromData, np.PyArray_UpdateFlags, or memcpy.

For instance, you could add comments with the equivalent python expressions:
# equivalent to: distances[i] = np_dist_arr[:counts[i]].copy()

@recamshak
Copy link
Contributor Author

@TomDLT Thank you for your comment. I added some comments as you suggested.

Copy link
Member
@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Disclaimer: I am a novice in C memory allocation.

Could you add an entry in doc/whats_new/v0.20.rst?

@TomDLT TomDLT changed the title [MRG] Parallel radius neighbors [MRG+1] Parallel radius neighbors Apr 13, 2018
@recamshak
Copy link
Contributor Author

@TomDLT thank you for the review. I updated the release notes.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work, thanks

@jnothman
Copy link
Member

Merge when green

@jnothman jnothman merged commit 4335199 into scikit-learn:master Apr 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBSCAN seems not to use multiple processors (n_jobs argument ignored)
3 participants
0