8000 [MRG] Updated dbscan(): added documentation by Don86 · Pull Request #8039 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Updated dbscan(): added documentation #8039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed

[MRG] Updated dbscan(): added documentation #8039

wants to merge 10 commits into from

Conversation

Don86
Copy link
Contributor
@Don86 Don86 commented Dec 12, 2016

Issue #8003 DBSCAN seems not to use multiple processors (n_jobs argument ignored).
Added documentation in dbscan() to alert users to missing functionality, and some information about why this functionality isn't always there.

Updated dbscan() documentation about missing functionality in n_jobs parameter.

Updated dbscan() documentation about missing functionality in n_jobs parameter.
Fixed trailing whitespace at line 77.
Fixed whitespace at line 100.
@jnothman
Copy link
Member

nearest neighbor is parallelised in the brute case.

Added "Setting algorithm="brute" uses multiple cores, but may cause a slow down instead." at line 103.
@@ -100,7 +100,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',

Note about n_jobs: this does not seem to use multiple processors. Calls
NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour
has not yet been parrallelized.
has not yet been parrallelized. Setting algorithm="brute" uses multiple
cores, but may cause a slow down instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all parallelism may cause a slow down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed wording to reflect this: changed "may" to "will", removed "instead".

@@ -100,7 +100,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',

Note about n_jobs: this does not seem to use multiple processors. Calls
NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour
has not yet been parrallelized.
has not yet been parrallelized. Setting algorithm="brute" uses multiple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only think we ever intended to parallelise the nearest neighbors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean: "I don't think we ever intended to parallelize the nearest neighbors"?
Removed "yet".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant what I said: I don't think we had any intentions to parallelise dbscan beyond its nn queries.

Updated comments in response to latest feedback.
Note about n_jobs: this does not seem to use multiple processors. Calls
NearestNeighbour, passing the n_jobs parameter to it, but NearestNeighbour
has not been parrallelized. Setting algorithm="brute" uses multiple
cores, but will cause a slow down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of :

Note about ``n_jobs``: when ``algorithm="brute"``, the ``n_jobs`` parameter is used to compute a brute force multiple processors approach which can cause a slow down. For a faster computation we recommend to use other NearestNeighbors algorithms by choosing another value for the parameter ``algorithm``. Nevertheless, these NearestNeighbors methods are not parrellelized in scikit-learn and will not use the ``n_jobs`` parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for your feedback. Yup, sounds good, I've made the changes.

@@ -74,6 +74,8 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
n_jobs : int, optional (default = 1)
The number of parallel jobs to run for neighbors search.
If ``-1``, then the number of jobs is set to the number of CPU cores.
Currently may not work as expected; might not use multiple processors.
See notes below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to directly add the note information here.

8000
@tguillemot
Copy link
Contributor

Thanks @Don86. I've done some suggestions, tell me what you think ?

Rewrite in response to feedback.
@@ -74,6 +74,13 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
n_jobs : int, optional (default = 1)
The number of parallel jobs to run for neighbors search.
If ``-1``, then the number of jobs is set to the number of CPU cores.
Currently may not work as expected. When ``algorithm="brute"``, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Parallelism is only currently used when algorithm="brute", though other algorithms may remain faster."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for your input. I tinkered with the comments a bit to assimilate your feedback, but still left the whole explanation there so that users will have a better idea of what's going on.

Updated in response to feedback
Fixed whitespace
Changed line 151 to "Returns 1 for inliers and -1 for anomalies/outliers." to match main description; peviously mismatched.
@@ -74,6 +74,13 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
n_jobs : int, optional (default = 1)
The number of parallel jobs to run for neighbors search.
If ``-1``, then the number of jobs is set to the number of CPU cores.
Parrallelism currently only used when ``algorithm="brute"``, where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not finding this clear and succinct enough, which is why I suggested a specific wording.

@@ -148,7 +148,7 @@ def fit_predict(self, X, y=None):
Returns
-------
is_inlier : array, shape (n_samples,)
Returns 1 for anomalies/outliers and -1 for inliers.
Returns 1 for inliers and -1 for anomalies/outliers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't belong here

Rolled back some changes which don't belong in this branch.
@@ -148,7 +148,7 @@ def fit_predict(self, X, y=None):
Returns
-------
is_inlier : array, shape (n_samples,)
Returns 1 for inliers and -1 for anomalies/outliers.
Returns 1 for anomalies/outliers and -1 for inliers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it related to dbscan ? If it is not, can you open another PR.
BTW, it seems that this confusion is done in a lot of places in the LOF file (ex : l138 ).
Can you check the entire file ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the my mistake.
I had a look to the diff not the PR.

@amueller
Copy lin 9E88 k
Member

@jnothman is this still relevant?

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really relevant anymore. It is now only ineffective when metric='precomputed'

@jnothman jnothman closed this Sep 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0