-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Updated dbscan(): added documentation #8039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updated dbscan() documentation about missing functionality in n_jobs parameter.
Fixed trailing whitespace at line 77.
Fixed whitespace again.
Fixed whitespace at line 100.
nearest neighbor is parallelised in the brute case. |
Added "Setting algorithm="brute" uses multiple cores, but may cause a slow down instead." at line 103.
@@ -100,7 +100,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
|
|||
Note about n_jobs: this does not seem to use multiple processors. Calls | |||
NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour | |||
has not yet been parrallelized. | |||
has not yet been parrallelized. Setting algorithm="brute" uses multiple | |||
cores, but may cause a slow down instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all parallelism may cause a slow down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed wording to reflect this: changed "may" to "will", removed "instead".
@@ -100,7 +100,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
|
|||
Note about n_jobs: this does not seem to use multiple processors. Calls | |||
NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour | |||
has not yet been parrallelized. | |||
has not yet been parrallelized. Setting algorithm="brute" uses multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only think we ever intended to parallelise the nearest neighbors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean: "I don't think we ever intended to parallelize the nearest neighbors"?
Removed "yet".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant what I said: I don't think we had any intentions to parallelise dbscan beyond its nn queries.
Updated comments in response to latest feedback.
Note about n_jobs: this does not seem to use multiple processors. Calls | ||
NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour | ||
has not been parrallelized. Setting algorithm="brute" uses multiple | ||
cores, but will cause a slow down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of :
Note about ``n_jobs``: when ``algorithm="brute"``, the ``n_jobs`` parameter is used to compute a brute force multiple processors approach which can cause a slow down. For a faster computation we recommend to use other NearestNeighbors algorithms by choosing another value for the parameter ``algorithm``. Nevertheless, these NearestNeighbors methods are not parrellelized in scikit-learn and will not use the ``n_jobs`` parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for your feedback. Yup, sounds good, I've made the changes.
@@ -74,6 +74,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
n_jobs : int, optional (default = 1) | |||
The number of parallel jobs to run for neighbors search. | |||
If ``-1``, then the number of jobs is set to the number of CPU cores. | |||
Currently may not work as expected; might not use multiple processors. | |||
See notes below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to directly add the note information here.
Thanks @Don86. I've done some suggestions, tell me what you think ? |
Rewrite in response to feedback.
@@ -74,6 +74,13 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
n_jobs : int, optional (default = 1) | |||
The number of parallel jobs to run for neighbors search. | |||
If ``-1``, then the number of jobs is set to the number of CPU cores. | |||
Currently may not work as expected. When ``algorithm="brute"``, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Parallelism is only currently used when algorithm="brute", though other algorithms may remain faster."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for your input. I tinkered with the comments a bit to assimilate your feedback, but still left the whole explanation there so that users will have a better idea of what's going on.
Updated in response to feedback Fixed whitespace
Changed line 151 to "Returns 1 for inliers and -1 for anomalies/outliers." to match main description; peviously mismatched.
@@ -74,6 +74,13 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', | |||
n_jobs : int, optional (default = 1) | |||
The number of parallel jobs to run for neighbors search. | |||
If ``-1``, then the number of jobs is set to the number of CPU cores. | |||
Parrallelism currently only used when ``algorithm="brute"``, where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not finding this clear and succinct enough, which is why I suggested a specific wording.
@@ -148,7 +148,7 @@ def fit_predict(self, X, y=None): | |||
Returns | |||
------- | |||
is_inlier : array, shape (n_samples,) | |||
Returns 1 for anomalies/outliers and -1 for inliers. | |||
Returns 1 for inliers and -1 for anomalies/outliers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't belong here
Rolled back some changes which don't belong in this branch.
@@ -148,7 +148,7 @@ def fit_predict(self, X, y=None): | |||
Returns | |||
------- | |||
is_inlier : array, shape (n_samples,) | |||
Returns 1 for inliers and -1 for anomalies/outliers. | |||
Returns 1 for anomalies/outliers and -1 for inliers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it related to dbscan ? If it is not, can you open another PR.
BTW, it seems that this confusion is done in a lot of places in the LOF file (ex : l138 ).
Can you check the entire file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the my mistake.
I had a look to the diff not the PR.
@jnothman is this still relevant? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really relevant anymore. It is now only ineffective when metric='precomputed'
Issue #8003 DBSCAN seems not to use multiple processors (n_jobs argument ignored).
Added documentation in dbscan() to alert users to missing functionality, and some information about why this functionality isn't always there.
Updated dbscan() documentation about missing functionality in n_jobs parameter.