[MRG] Updated dbscan(): added documentation #8039

Don86 · 2016-12-12T05:27:03Z

Issue #8003 DBSCAN seems not to use multiple processors (n_jobs argument ignored).
Added documentation in dbscan() to alert users to missing functionality, and some information about why this functionality isn't always there.

Updated dbscan() documentation about missing functionality in n_jobs parameter.

Fixed trailing whitespace at line 77.

Fixed whitespace again.

Fixed whitespace at line 100.

jnothman · 2016-12-12T12:11:53Z

nearest neighbor is parallelised in the brute case.

Added "Setting algorithm="brute" uses multiple cores, but may cause a slow down instead." at line 103.

jnothman · 2016-12-13T02:41:04Z

sklearn/cluster/dbscan_.py

@@ -100,7 +100,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',

    Note about n_jobs: this does not seem to use multiple processors. Calls
    NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour
-    has not yet been parrallelized.
+    has not yet been parrallelized. Setting algorithm="brute" uses multiple
+    cores, but may cause a slow down instead.


all parallelism may cause a slow down

Changed wording to reflect this: changed "may" to "will", removed "instead".

jnothman · 2016-12-13T02:41:26Z

sklearn/cluster/dbscan_.py

@@ -100,7 +100,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',

    Note about n_jobs: this does not seem to use multiple processors. Calls
    NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour
-    has not yet been parrallelized.
+    has not yet been parrallelized. Setting algorithm="brute" uses multiple


I only think we ever intended to parallelise the nearest neighbors.

I assume you mean: "I don't think we ever intended to parallelize the nearest neighbors"?
Removed "yet".

No, I meant what I said: I don't think we had any intentions to parallelise dbscan beyond its nn queries.

Updated comments in response to latest feedback.

tguillemot · 2016-12-13T09:50:03Z

sklearn/cluster/dbscan_.py

+    Note about n_jobs: this does not seem to use multiple processors. Calls
+    NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour
+    has not been parrallelized. Setting algorithm="brute" uses multiple
+    cores, but will cause a slow down.


What do you think of :

Note about ``n_jobs``: when ``algorithm="brute"``, the ``n_jobs`` parameter is used to compute a brute force multiple processors approach which can cause a slow down. For a faster computation we recommend to use other NearestNeighbors algorithms by choosing another value for the parameter ``algorithm``. Nevertheless, these NearestNeighbors methods are not parrellelized in scikit-learn and will not use the ``n_jobs`` parameter.

Hi, thanks for your feedback. Yup, sounds good, I've made the changes.

tguillemot · 2016-12-13T09:54:50Z

sklearn/cluster/dbscan_.py

@@ -74,6 +74,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
    n_jobs : int, optional (default = 1)
        The number of parallel jobs to run for neighbors search.
        If ``-1``, then the number of jobs is set to the number of CPU cores.
+        Currently may not work as expected; might not use multiple processors.
+        See notes below.


Maybe it's better to directly add the note information here.

tguillemot · 2016-12-13T09:56:03Z

Thanks @Don86. I've done some suggestions, tell me what you think ?

Rewrite in response to feedback.

jnothman · 2016-12-14T10:20:34Z

sklearn/cluster/dbscan_.py

@@ -74,6 +74,13 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
    n_jobs : int, optional (default = 1)
        The number of parallel jobs to run for neighbors search.
                      If ``-1``, then the number of jobs is set to the number of CPU cores.

            
+        Currently may not work as expected. When ``algorithm="brute"``, the 


How about "Parallelism is only currently used when algorithm="brute", though other algorithms may remain faster."

Hi, thanks for your input. I tinkered with the comments a bit to assimilate your feedback, but still left the whole explanation there so that users will have a better idea of what's going on.

Updated in response to feedback Fixed whitespace

Changed line 151 to "Returns 1 for inliers and -1 for anomalies/outliers." to match main description; peviously mismatched.

jnothman · 2016-12-15T07:54:27Z

sklearn/cluster/dbscan_.py

@@ -74,6 +74,13 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
    n_jobs : int, optional (default = 1)
        The number of parallel jobs to run for neighbors search.
        If ``-1``, then the number of jobs is set to the number of CPU cores.
+        Parrallelism currently only used when ``algorithm="brute"``, where


I'm really not finding this clear and succinct enough, which is why I suggested a specific wording.

jnothman · 2016-12-15T07:54:41Z

sklearn/neighbors/lof.py

@@ -148,7 +148,7 @@ def fit_predict(self, X, y=None):
        Returns
        -------
        is_inlier : array, shape (n_samples,)
-            Returns 1 for anomalies/outliers and -1 for inliers.
+            Returns 1 for inliers and -1 for anomalies/outliers.


Doesn't belong here

Rolled back some changes which don't belong in this branch.

tguillemot · 2016-12-16T08:55:27Z

sklearn/neighbors/lof.py

@@ -148,7 +148,7 @@ def fit_predict(self, X, y=None):
        Returns
        -------
        is_inlier : array, shape (n_samples,)
-            Returns 1 for inliers and -1 for anomalies/outliers.
+            Returns 1 for anomalies/outliers and -1 for inliers.


Is it related to dbscan ? If it is not, can you open another PR.
BTW, it seems that this confusion is done in a lot of places in the LOF file (ex : l138 ).
Can you check the entire file ?

Sorry for the my mistake.
I had a look to the diff not the PR.

amueller · 2018-09-27T19:57:44Z

@jnothman is this still relevant?

jnothman

Not really relevant anymore. It is now only ineffective when metric='precomputed'

Don86 added 4 commits December 12, 2016 16:24

Updated dbscan(): added documentation

268f69f

Updated dbscan() documentation about missing functionality in n_jobs parameter.

Updated dbscan(): fixed whitespace

7dca7f3

Fixed trailing whitespace at line 77.

Updated dbscan(): fixed whitespace (again)

8bd4435

Fixed whitespace again.

Update dbscan(): Take 3

5865eec

Fixed whitespace at line 100.

Update dbscan(): Addition for "brute" algorithm

< 10000 div class="commit-build-statuses">

6161ae7

Added "Setting algorithm="brute" uses multiple cores, but may cause a slow down instead." at line 103.

jnothman reviewed Dec 13, 2016

View reviewed changes

Updated dbscan(): update 5

da280a5

Updated comments in response to latest feedback.

tguillemot suggested changes Dec 13, 2016

View reviewed changes

Update dbscan()

e141dce

Rewrite in response to feedback.

jnothman reviewed Dec 14, 2016

View reviewed changes

Don86 added 2 commits December 15, 2016 12:43

Updated dbscan(): Update 8

8648694

Updated in response to feedback Fixed whitespace

Update fit_predict(): documentation

176979e

Changed line 151 to "Returns 1 for inliers and -1 for anomalies/outliers." to match main description; peviously mismatched.

Don86 mentioned this pull request Dec 15, 2016

the contradiction in docstring in LocalOutlierFactor class #8048

Closed

jnothman reviewed Dec 15, 2016

View reviewed changes

Update fit_predict()

1b8c8c2

Rolled back some changes which don't belong in this branch.

tguillemot reviewed Dec 16, 2016

View reviewed changes

amueller mentioned this pull request Dec 16, 2016

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

Closed

jnothman reviewed Sep 28, 2018

View reviewed changes

jnothman closed this Sep 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Updated dbscan(): added documentation #8039

[MRG] Updated dbscan(): added documentation #8039

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] Updated dbscan(): added documentation #8039

[MRG] Updated dbscan(): added documentation #8039

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!