-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC merge dbscan, hdbcan, optics gallery examples into one example #31102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like the new example. It's tells a very nice story.
Maybe @lucyleeow could have a look?
import numpy as np | ||
|
||
|
||
def cluster_colours(labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little docstring for these functions to explain what they try to do would be nice
# DBSCAN | ||
# ------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# DBSCAN | |
# ------- | |
# DBSCAN | |
# ------ |
They need to be the same length, otherwise sphinx complains
# Tuning DBSCAN | ||
# ++++++++++++++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Tuning DBSCAN | |
# ++++++++++++++ | |
# Tuning DBSCAN | |
# +++++++++++++ |
# default values for `eps` and `min_samples` they usually will not provide good | ||
# results, as these sensitive to the shape of the dataset you are working with. | ||
# Larger values of `min_samples` yield more robustness to noise, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's been discussions about providing an "auto" version for eps
, which would sample distances from the dataset and then decide on a reasonable default. In case you're interested in pushing that forward, could be a separate PR.
# plot. However, finding good values of `eps` and `min_samples` is not always | ||
# as easy as this. It may require specialized knowledge of the dataset | ||
# one is working with. While standardizing the dataset with | ||
# `StandardScaler` may help with this problem, great care must be taken for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# `StandardScaler` may help with this problem, great care must be taken for | |
# :class:`~sklearn.preprocessingStandardScaler` may help with this problem, great care must be taken for |
and of course to trim the now long line
# Tuning HDBSCAN | ||
# ++++++++++++++++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Tuning HDBSCAN | |
# ++++++++++++++++ | |
# Tuning HDBSCAN | |
# ++++++++++++++ |
# generally tuned to larger values as needed. Smaller values will likely to | ||
# lead to results with fewer points labeled as noise. However values which too | ||
# small will lead to false sub-clusters being picked up and preferred. Larger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# generally tuned to larger values as needed. Smaller values will likely to | |
# lead to results with fewer points labeled as noise. However values which too | |
# small will lead to false sub-clusters being picked up and preferred. Larger | |
# generally tuned to larger values as needed. Smaller values will likely | |
# lead to results with fewer points labeled as noise. However too small | |
# values will lead to false sub-clusters being picked up and preferred. Larger |
# clustering of all points across all values of DBSCAN’s `eps` parameter. We | ||
# can efficiently obtain these DBSCAN like clusterings efficiently without | ||
# fully recomputing intermediate values required for an HDBSCAN fit, such as | ||
# core-distances, mutual-reachability, and the minimum spanning tree This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# core-distances, mutual-reachability, and the minimum spanning tree This is | |
# core-distances, mutual-reachability, and the minimum spanning tree. This is |
# OPTICS | ||
# ------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# OPTICS | |
# ------- | |
# OPTICS | |
# ------ |
if k == -1: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if k == -1: | |
continue | |
if k == -1: | |
# skip points labeled as outliers | |
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement to the previous 3 examples, and it's nice that we can reduce the number of examples.
My only comment is that it is on the longer side, though I have not suggestions as how to make it shorter. I've tried to remove redundant sentences (e.g., things that are explained in the docstring of the model)
I've only got minor grammar nits.
I noticed we used both 'labelled' (UK spelling) and 'labeled' (US spelling), maybe we could pick one and make it consistent ?
The other thing to note is that the deleted examples may have been referenced in the docs / docstrings. I would search for the title of the deleted examples to find references to them. You can replace with the a reference to the new example. I think this should be sphx_glr_auto_examples_cluster_plot_dbscan_hdbscan_optics.py
======================================================= | ||
Demo of DBSCAN, HDBSCAN, OPTICS clustering algorithms | ||
======================================================= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The over/under-lining should be the same length as the title
======================================================= | ||
.. currentmodule:: sklearn | ||
|
||
In this demo we will take a look at DBSCAN, HDBSCAN, and OPTICs clustering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this demo we will take a look at DBSCAN, HDBSCAN, and OPTICs clustering | |
In this demo we will take a look at DBSCAN, HDBSCAN, and OPTICS clustering |
Also I may be inclined to give what the acronyms stand for at the start, when you first mention the acronyms, instead of later on.
# To gain some intuition about how DBSCAN works, we | ||
# need to discuss the concept of core points. We will also discuss neighborhoods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# To gain some intuition about how DBSCAN works, we | |
# need to discuss the concept of core points. We will also discuss neighborhoods. | |
# To gain some intuition about how DBSCAN works, we | |
# need to discuss the concept of core points and neighborhoods. |
# need to discuss the concept of core points. We will also discuss neighborhoods. | ||
# Generally speaking, when we talk about the neighborhood of a point, we are | ||
# referring to a collection of points in the dataset that are within a certain | ||
# distance from it. The size of the neighborhood will depend on the context. A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The size of the neighborhood will depend on the context"
Not sure exactly what context means here?
can be viewed as generalizations of DBSCAN, and efficiently extract DBSCAN | ||
clusterings from the results of running these algorithms. | ||
|
||
We start by defining helper functions to visualize a dataset and the resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, as we are mostly taking about the plot
function and we use singular below in "use this function".
We start by defining helper functions to visualize a dataset and the resulting | |
We start by defining a helper function to visualize a dataset and the resulting |
# The Xi clustering method uses the steep slopes within the reachability | ||
# plot to determine which points are in the same cluster. One can specify | ||
# this by setting the `xi` parameter. More details on this parameter | ||
# may be found the :ref:`User Guide <OPTICS>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# may be found the :ref:`User Guide <OPTICS>`. | |
# may be found the :ref:`User Guide <optics>`. |
# clustering. The user can choose which method by setting the `cluster_method` | ||
# parameter to either `xi` (default) or `dbscan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# clustering. The user can choose which method by setting the `cluster_method` | |
# parameter to either `xi` (default) or `dbscan`. | |
# clustering. |
# `cluster_optics_dbscan()`. This function takes the reachability, ordering, | ||
# and core distances produced from running OPTICS, and an `eps` parameter | ||
# analogous to its counterpart in DBSCAN. It produces a clustering with results | ||
# similar if one were to run DBSCAN on the same dataset with similar settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# similar if one were to run DBSCAN on the same dataset with similar settings | |
# similar to if one were to run DBSCAN on the same dataset with similar settings |
# for `eps` and `min_samples`. The runtime is efficient, it is linear in the | ||
# number of samples whereas running DBSCAN from scratch is quadratic in its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# for `eps` and `min_samples`. The runtime is efficient, it is linear in the | |
# number of samples whereas running DBSCAN from scratch is quadratic in its | |
# for `eps` and `min_samples`. The runtime is efficient, it is linear to the | |
# number of samples whereas running DBSCAN from scratch is quadratic in its |
|
||
|
||
for idx, eps in enumerate(eps_values): | ||
plot_optics_dbscan(optics, eps, axes[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a very short summary of the 3 methods here.
thanks for the comments and your time, i will address them in the next few work days |
Reference Issues/PRs
Fixes #29962
What does this implement/fix? Explain your changes.
I have done my best to merge the three user gallery examples in the above issue into one readable and cohesive document.
Any other comments?
I am relatively new to writing technical documents and to the ideas of these clustering algorithms, and would appreciate any feedback from others on how I can improve this document. I also think there is a UI bug with the bookmarks on the right hand side when viewing the example on a web browser on a desktop. It seems like the section titles are not being highlighted at the correct time, HDBSCAN sections are highlighted as the user scrolls through the OPTICS section. I'm not sure how to fix it that so I would like some advice on how to proceed on that part.