[MRG] Updated dbscan(): added documentation#8039
[MRG] Updated dbscan(): added documentation#8039Don86 wants to merge 10 commits intoscikit-learn:masterfrom Don86:feature
Conversation
Updated dbscan() documentation about missing functionality in n_jobs parameter.
Fixed trailing whitespace at line 77.
Fixed whitespace again.
Fixed whitespace at line 100.
|
nearest neighbor is parallelised in the brute case. |
Added "Setting algorithm="brute" uses multiple cores, but may cause a slow down instead." at line 103.
sklearn/cluster/dbscan_.py
Outdated
| NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour | ||
| has not yet been parrallelized. | ||
| has not yet been parrallelized. Setting algorithm="brute" uses multiple | ||
| cores, but may cause a slow down instead. |
There was a problem hiding this comment.
all parallelism may cause a slow down
There was a problem hiding this comment.
Changed wording to reflect this: changed "may" to "will", removed "instead".
sklearn/cluster/dbscan_.py
Outdated
| Note about n_jobs: this does not seem to use multiple processors. Calls | ||
| NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour | ||
| has not yet been parrallelized. | ||
| has not yet been parrallelized. Setting algorithm="brute" uses multiple |
There was a problem hiding this comment.
I only think we ever intended to parallelise the nearest neighbors.
There was a problem hiding this comment.
I assume you mean: "I don't think we ever intended to parallelize the nearest neighbors"?
Removed "yet".
There was a problem hiding this comment.
No, I meant what I said: I don't think we had any intentions to parallelise dbscan beyond its nn queries.
Updated comments in response to latest feedback.
sklearn/cluster/dbscan_.py
Outdated
| Note about n_jobs: this does not seem to use multiple processors. Calls | ||
| NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour | ||
| has not been parrallelized. Setting algorithm="brute" uses multiple | ||
| cores, but will cause a slow down. |
There was a problem hiding this comment.
What do you think of :
Note about ``n_jobs``: when ``algorithm="brute"``, the ``n_jobs`` parameter is used to compute a brute force multiple processors approach which can cause a slow down. For a faster computation we recommend to use other NearestNeighbors algorithms by choosing another value for the parameter ``algorithm``. Nevertheless, these NearestNeighbors methods are not parrellelized in scikit-learn and will not use the ``n_jobs`` parameter.
There was a problem hiding this comment.
Hi, thanks for your feedback. Yup, sounds good, I've made the changes.
sklearn/cluster/dbscan_.py
Outdated
| The number of parallel jobs to run for neighbors search. | ||
| If ``-1``, then the number of jobs is set to the number of CPU cores. | ||
| Currently may not work as expected; might not use multiple processors. | ||
| See notes below. |
There was a problem hiding this comment.
Maybe it's better to directly add the note information here.
|
Thanks @Don86. I've done some suggestions, tell me what you think ? |
Rewrite in response to feedback.
sklearn/cluster/dbscan_.py
Outdated
| n_jobs : int, optional (default = 1) | ||
| The number of parallel jobs to run for neighbors search. | ||
| If ``-1``, then the number of jobs is set to the number of CPU cores. | ||
| Currently may not work as expected. When ``algorithm="brute"``, the |
There was a problem hiding this comment.
How about "Parallelism is only currently used when algorithm="brute", though other algorithms may remain faster."
There was a problem hiding this comment.
Hi, thanks for your input. I tinkered with the comments a bit to assimilate your feedback, but still left the whole explanation there so that users will have a better idea of what's going on.
Updated in response to feedback Fixed whitespace
Changed line 151 to "Returns 1 for inliers and -1 for anomalies/outliers." to match main description; peviously mismatched.
| n_jobs : int, optional (default = 1) | ||
| The number of parallel jobs to run for neighbors search. | ||
| If ``-1``, then the number of jobs is set to the number of CPU cores. | ||
| Parrallelism currently only used when ``algorithm="brute"``, where |
There was a problem hiding this comment.
I'm really not finding this clear and succinct enough, which is why I suggested a specific wording.
sklearn/neighbors/lof.py
Outdated
| ------- | ||
| is_inlier : array, shape (n_samples,) | ||
| Returns 1 for anomalies/outliers and -1 for inliers. | ||
| Returns 1 for inliers and -1 for anomalies/outliers. |
Rolled back some changes which don't belong in this branch.
| ------- | ||
| is_inlier : array, shape (n_samples,) | ||
| Returns 1 for inliers and -1 for anomalies/outliers. | ||
| Returns 1 for anomalies/outliers and -1 for inliers. |
There was a problem hiding this comment.
Is it related to dbscan ? If it is not, can you open another PR.
BTW, it seems that this confusion is done in a lot of places in the LOF file (ex : l138 ).
Can you check the entire file ?
There was a problem hiding this comment.
Sorry for the my mistake.
I had a look to the diff not the PR.
|
@jnothman is this still relevant? |
There was a problem hiding this comment.
Not really relevant anymore. It is now only ineffective when metric='precomputed'
Issue #8003 DBSCAN seems not to use multiple processors (n_jobs argument ignored).
Added documentation in dbscan() to alert users to missing functionality, and some information about why this functionality isn't always there.
Updated dbscan() documentation about missing functionality in n_jobs parameter.