8000 DOC merge dbscan, hdbcan, optics gallery examples into one example by daustria · Pull Request #31102 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC merge dbscan, hdbcan, optics gallery examples into one example #31102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

daustria
Copy link
@daustria daustria commented Mar 28, 2025

Reference Issues/PRs

Fixes #29962

What does this implement/fix? Explain your changes.

I have done my best to merge the three user gallery examples in the above issue into one readable and cohesive document.

Any other comments?

I am relatively new to writing technical documents and to the ideas of these clustering algorithms, and would appreciate any feedback from others on how I can improve this document. I also think there is a UI bug with the bookmarks on the right hand side when viewing the example on a web browser on a desktop. It seems like the section titles are not being highlighted at the correct time, HDBSCAN sections are highlighted as the user scrolls through the OPTICS section. I'm not sure how to fix it that so I would like some advice on how to proceed on that part.

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e9ae604. Link to the linter CI: here

@daustria daustria changed the title merge dbscan, hdbcan, optics gallery examples into one example DOC merge dbscan, hdbcan, optics gallery examples into one example Mar 31, 2025
Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like the new example. It's tells a very nice story.

Maybe @lucyleeow could have a look?

import numpy as np


def cluster_colours(labels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little docstring for these functions to explain what they try to do would be nice

Comment on lines +70 to +71
# DBSCAN
# -------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# DBSCAN
# -------
# DBSCAN
# ------

They need to be the same length, otherwise sphinx complains

Comment on lines +131 to +132
# Tuning DBSCAN
# ++++++++++++++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Tuning DBSCAN
# ++++++++++++++
# Tuning DBSCAN
# +++++++++++++

Comment on lines +134 to +136
# default values for `eps` and `min_samples` they usually will not provide good
# results, as these sensitive to the shape of the dataset you are working with.
# Larger values of `min_samples` yield more robustness to noise, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's been discussions about providing an "auto" version for eps, which would sample distances from the dataset and then decide on a reasonable default. In case you're interested in pushing that forward, could be a separate PR.

# plot. However, finding good values of `eps` and `min_samples` is not always
# as easy as this. It may require specialized knowledge of the dataset
# one is working with. While standardizing the dataset with
# `StandardScaler` may help with this problem, great care must be taken for
Copy link
Member
@adrinjalali adrinjalali May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# `StandardScaler` may help with this problem, great care must be taken for
# :class:`~sklearn.preprocessingStandardScaler` may help with this problem, great care must be taken for

and of course to trim the now long line

Comment on lines +252 to +253
# Tuning HDBSCAN
# ++++++++++++++++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Tuning HDBSCAN
# ++++++++++++++++
# Tuning HDBSCAN
# ++++++++++++++

Comment on lines +263 to +265
# generally tuned to larger values as needed. Smaller values will likely to
# lead to results with fewer points labeled as noise. However values which too
# small will lead to false sub-clusters being picked up and preferred. Larger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# generally tuned to larger values as needed. Smaller values will likely to
# lead to results with fewer points labeled as noise. However values which too
# small will lead to false sub-clusters being picked up and preferred. Larger
# generally tuned to larger values as needed. Smaller values will likely
# lead to results with fewer points labeled as noise. However too small
# values will lead to false sub-clusters being picked up and preferred. Larger

# clustering of all points across all values of DBSCAN’s `eps` parameter. We
# can efficiently obtain these DBSCAN like clusterings efficiently without
# fully recomputing intermediate values required for an HDBSCAN fit, such as
# core-distances, mutual-reachability, and the minimum spanning tree This is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# core-distances, mutual-reachability, and the minimum spanning tree This is
# core-distances, mutual-reachability, and the minimum spanning tree. This is

Comment on lines +334 to +335
# OPTICS
# -------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# OPTICS
# -------
# OPTICS
# ------

Comment on lines +382 to +383
if k == -1:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if k == -1:
continue
if k == -1:
# skip points labeled as outliers
continue

Copy link
Member
@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement to the previous 3 examples, and it's nice that we can reduce the number of examples.

My only comment is that it is on the longer side, though I have not suggestions as how to make it shorter. I've tried to remove redundant sentences (e.g., things that are explained in the docstring of the model)

I've only got minor grammar nits.
I noticed we used both 'labelled' (UK spelling) and 'labeled' (US spelling), maybe we could pick one and make it consistent ?

The other thing to note is that the deleted examples may have been referenced in the docs / docstrings. I would search for the title of the deleted examples to find references to them. You can replace with the a reference to the new example. I think this should be sphx_glr_auto_examples_cluster_plot_dbscan_hdbscan_optics.py

Comment on lines +3 to +5
=======================================================
Demo of DBSCAN, HDBSCAN, OPTICS clustering algorithms
=======================================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The over/under-lining should be the same length as the title

=======================================================
.. currentmodule:: sklearn

In this demo we will take a look at DBSCAN, HDBSCAN, and OPTICs clustering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this demo we will take a look at DBSCAN, HDBSCAN, and OPTICs clustering
In this demo we will take a look at DBSCAN, HDBSCAN, and OPTICS clustering

Also I may be inclined to give what the acronyms stand for at the start, when you first mention the acronyms, instead of later on.

Comment on lines +76 to +77
# To gain some intuition about how DBSCAN works, we
# need to discuss the concept of core points. We will also discuss neighborhoods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# To gain some intuition about how DBSCAN works, we
# need to discuss the concept of core points. We will also discuss neighborhoods.
# To gain some intuition about how DBSCAN works, we
# need to discuss the concept of core points and neighborhoods.

# need to discuss the concept of core points. We will also discuss neighborhoods.
# Generally speaking, when we talk about the neighborhood of a point, we are
# referring to a collection of points in the dataset that are within a certain
# distance from it. The size of the neighborhood will depend on the context. A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The size of the neighborhood will depend on the context"

Not sure exactly what context means here?

can be viewed as generalizations of DBSCAN, and efficiently extract DBSCAN
clusterings from the results of running these algorithms.

We start by defining helper functions to visualize a dataset and the resulting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, as we are mostly taking about the plot function and we use singular below in "use this function".

Suggested change
We start by defining helper functions to visualize a dataset and the resulting
We start by defining a helper function to visualize a dataset and the resulting

# The Xi clustering method uses the steep slopes within the reachability
# plot to determine which points are in the same cluster. One can specify
# this by setting the `xi` parameter. More details on this parameter
# may be found the :ref:`User Guide <OPTICS>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# may be found the :ref:`User Guide <OPTICS>`.
# may be found the :ref:`User Guide <optics>`.

Comment on lines +400 to +401
# clustering. The user can choose which method by setting the `cluster_method`
# parameter to either `xi` (default) or `dbscan`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# clustering. The user can choose which method by setting the `cluster_method`
# parameter to either `xi` (default) or `dbscan`.
# clustering.

# `cluster_optics_dbscan()`. This function takes the reachability, ordering,
# and core distances produced from running OPTICS, and an `eps` parameter
# analogous to its counterpart in DBSCAN. It produces a clustering with results
# similar if one were to run DBSCAN on the same dataset with similar settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# similar if one were to run DBSCAN on the same dataset with similar settings
# similar to if one were to run DBSCAN on the same dataset with similar settings

Comment on lines +423 to +424
# for `eps` and `min_samples`. The runtime is efficient, it is linear in the
# number of samples whereas running DBSCAN from scratch is quadratic in its
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# for `eps` and `min_samples`. The runtime is efficient, it is linear in the
# number of samples whereas running DBSCAN from scratch is quadratic in its
# for `eps` and `min_samples`. The runtime is efficient, it is linear to the
# number of samples whereas running DBSCAN from scratch is quadratic in its



for idx, eps in enumerate(eps_values):
plot_optics_dbscan(optics, eps, axes[idx])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a very short summary of the 3 methods here.

@daustria
Copy link
Author
daustria commented May 14, 2025

thanks for the comments and your time, i will address them in the next few work days
edit: some things in life came up, might be a while longer, hopefully this month

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC merging the examples related to OPTICS, DBSCAN, and HDBSCAN
3 participants
0