8000 DOC Improve OPTICS doc by qinhanmin2014 · Pull Request #13866 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
/ < 8000 a data-pjax="#repo-content-pjax-container" data-turbo-frame="repo-content-turbo-frame" href="/scikit-learn/scikit-learn">scikit-learn Public

DOC Improve OPTICS doc #13866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 15, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 35 additions & 27 deletions sklearn/cluster/optics_.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
Authors: Shane Grigsby <refuge@rocktalus.com>
Adrin Jalali <adrinjalali@gmail.com>
Erich Schubert <erich@debian.org>
Hanmin Qin <qinhanmin2005@sina.com>
License: BSD 3 clause
"""

Expand All @@ -23,13 +24,15 @@
class OPTICS(BaseEstimator, ClusterMixin):
"""E 8000 stimate clustering structure from vector array

OPTICS: Ordering Points To Identify the Clustering Structure Closely
OPTICS (Ordering Points To Identify the Clustering Structure), closely
related to DBSCAN, finds core sample of high density and expands clusters
from them [1]_. Unlike DBSCAN, keeps cluster hierarchy for a variable
neighborhood radius. Better suited for usage on large datasets than the
current sklearn implementation of DBSCAN.

Clusters are then extracted using a DBSCAN like method [1]_.
Clusters are then extracted using a DBSCAN-like method
(cluster_method = 'dbscan') or an automatic
technique proposed in [1]_ (cluster_method = 'xi').

This implementation deviates from the original OPTICS by first performing
k-nearest-neighborhood searches on all points to identify core sizes, then
Expand All @@ -49,22 +52,21 @@ class OPTICS(BaseEstimator, ClusterMixin):
2).

max_eps : float, optional (default=np.inf)
The maximum distance between two samples for them to be considered
as in the same neighborhood. Default value of ``np.inf`` will identify
clusters across all scales; reducing ``max_eps`` will result in
shorter run times.
The maximum distance between two samples for one to be considered as
in the neighborhood of the other. Default value of ``np.inf`` will
identify clusters across all scales; reducing ``max_eps`` will result
in shorter run times.

metric : string or callable, optional (default='minkowski')
metric to use for distance computation. Any metric from scikit-learn
Metric to use for distance computation. Any metric from scikit-learn
or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each
pair of instances (rows) and the resulting value recorded. The callable
should take two arrays as input and return one value indicating the
distance between them. This works for Scipy's metrics, but is less
efficient than passing the metric name as a string.

Distance matrices are not supported.
efficient than passing the metric name as a string. If metric is
"precomputed", X is assumed to be a distance matrix and must be square.

Valid values for metric are:

Expand Down Expand Up @@ -94,9 +96,9 @@ class OPTICS(BaseEstimator, ClusterMixin):
reachability and ordering. Possible values are "xi" and "dbscan".

eps : float, optional (default=None)
The maximum distance between two samples for them to be considered
as in the same neighborhood. By default it assumes the same value as
``max_eps``.
The maximum distance between two samples for one to be considered as
in the neighborhood of the other. By default it assumes the same value
as ``max_eps``.
Used only when ``cluster_method='dbscan'``.

xi : float, between 0 and 1, optional (default=0.05)
Expand Down Expand Up @@ -219,8 +221,10 @@ def fit(self, X, y=None):

Parameters
----------
X : array, shape (n_samples, n_features)
The data.
X : array, shape (n_samples, n_features), or (n_samples, n_samples) \
if metric=’precomputed’.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this indented... I don't see any reason to have it flush left.

Suggested change
if metric=precomputed’.
if metric=precomputed’.

Copy link
Member Author
8000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #5078, we do similar things in some latest classes, e.g., ColumnTransformer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had time to pull this into an ancient discussion of this at numpydoc, I would... Thanks

A feature array, or array of distances between samples if
metric='precomputed'.

y : ignored

Expand Down Expand Up @@ -332,31 +336,32 @@ def compute_optics_graph(X, min_samples, max_eps, metric, p, metric_params,

Parameters
----------
X : array, shape (n_samples, n_features)
The data.
X : array, shape (n_samples, n_features), or (n_samples, n_samples) \
if metric=’precomputed’.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if metric=precomputed’.
if metric=precomputed’.

A feature array, or array of distances between samples if
metric='precomputed'

min_samples : int (default=5)
The number of samples in a neighborhood for a point to be considered
as a core point. Expressed as an absolute number or a fraction of the
number of samples (rounded to be at least 2).

max_eps : float, optional (default=np.inf)
The maximum distance between two samples for them to be considered
as in the same neighborhood. Default value of "np.inf" will identify
clusters across all scales; reducing `max_eps` will result in
shorter run times.
The maximum distance between two samples for one to be considered as
in the neighborhood of the other. Default value of ``np.inf`` will
identify clusters across all scales; reducing ``max_eps`` will result
in shorter run times.

metric : string or callable, optional (default='minkowski')
metric to use for distance computation. Any metric from scikit-learn
Metric to use for distance computation. Any metric from scikit-learn
or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each
pair of instances (rows) and the resulting value recorded. The callable
should take two arrays as input and return one value indicating the
distance between them. This works for Scipy's metrics, but is less
efficient than passing the metric name as a string.

Distance matrices are not supported.
efficient than passing the metric name as a string. If metric is
"precomputed", X is assumed to be a distance matrix and must be square.

Valid values for metric are:

Expand Down Expand Up @@ -771,8 +776,7 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples,
clusters.
"""

# all indices are inclusive (specially at the end)
# add an inf to the end of reachability plot
# Our implementation adds an inf to the end of reachability plot
# this helps to find potential clusters at the end of the
# reachability plot even if there's no upward region at the end of it.
reachability_plot = np.hstack((reachability_plot, np.inf))
Expand All @@ -783,6 +787,10 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples,
index = 0
mib = 0. # maximum in between, section 4.3.2

# Our implementation corrects a mistake in the original
# paper, i.e., in Definition 9 steep downward point,
# r(p) * (1 - x1) <= r(p + 1) should be
# r(p) * (1 - x1) >= r(p + 1)
with np.errstate(invalid='ignore'):
ratio = reachability_plot[:-1] / reachability_plot[1:]
steep_upward = ratio <= xi_complement
Expand Down
0