scikit-learn
/
<
8000
a data-pjax="#repo-content-pjax-container" data-turbo-frame="repo-content-turbo-frame" href="/scikit-learn/scikit-learn">scikit-learn
Public
-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Improve OPTICS doc #13866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
DOC Improve OPTICS doc #13866
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -7,6 +7,7 @@ | |||||
Authors: Shane Grigsby <refuge@rocktalus.com> | ||||||
Adrin Jalali <adrinjalali@gmail.com> | ||||||
Erich Schubert <erich@debian.org> | ||||||
Hanmin Qin <qinhanmin2005@sina.com> | ||||||
License: BSD 3 clause | ||||||
""" | ||||||
|
||||||
|
@@ -23,13 +24,15 @@ | |||||
class OPTICS(BaseEstimator, ClusterMixin): | ||||||
"""E 8000 stimate clustering structure from vector array | ||||||
|
||||||
OPTICS: Ordering Points To Identify the Clustering Structure Closely | ||||||
OPTICS (Ordering Points To Identify the Clustering Structure), closely | ||||||
related to DBSCAN, finds core sample of high density and expands clusters | ||||||
from them [1]_. Unlike DBSCAN, keeps cluster hierarchy for a variable | ||||||
neighborhood radius. Better suited for usage on large datasets than the | ||||||
current sklearn implementation of DBSCAN. | ||||||
|
||||||
Clusters are then extracted using a DBSCAN like method [1]_. | ||||||
Clusters are then extracted using a DBSCAN-like method | ||||||
(cluster_method = 'dbscan') or an automatic | ||||||
technique proposed in [1]_ (cluster_method = 'xi'). | ||||||
|
||||||
This implementation deviates from the original OPTICS by first performing | ||||||
k-nearest-neighborhood searches on all points to identify core sizes, then | ||||||
|
@@ -49,22 +52,21 @@ class OPTICS(BaseEstimator, ClusterMixin): | |||||
2). | ||||||
|
||||||
max_eps : float, optional (default=np.inf) | ||||||
The maximum distance between two samples for them to be considered | ||||||
as in the same neighborhood. Default value of ``np.inf`` will identify | ||||||
clusters across all scales; reducing ``max_eps`` will result in | ||||||
shorter run times. | ||||||
The maximum distance between two samples for one to be considered as | ||||||
in the neighborhood of the other. Default value of ``np.inf`` will | ||||||
identify clusters across all scales; reducing ``max_eps`` will result | ||||||
in shorter run times. | ||||||
|
||||||
metric : string or callable, optional (default='minkowski') | ||||||
metric to use for distance computation. Any metric from scikit-learn | ||||||
Metric to use for distance computation. Any metric from scikit-learn | ||||||
or scipy.spatial.distance can be used. | ||||||
|
||||||
If metric is a callable function, it is called on each | ||||||
pair of instances (rows) and the resulting value recorded. The callable | ||||||
should take two arrays as input and return one value indicating the | ||||||
distance between them. This works for Scipy's metrics, but is less | ||||||
efficient than passing the metric name as a string. | ||||||
|
||||||
Distance matrices are not supported. | ||||||
efficient than passing the metric name as a string. If metric is | ||||||
"precomputed", X is assumed to be a distance matrix and must be square. | ||||||
|
||||||
Valid values for metric are: | ||||||
|
||||||
|
@@ -94,9 +96,9 @@ class OPTICS(BaseEstimator, ClusterMixin): | |||||
reachability and ordering. Possible values are "xi" and "dbscan". | ||||||
|
||||||
eps : float, optional (default=None) | ||||||
The maximum distance between two samples for them to be considered | ||||||
as in the same neighborhood. By default it assumes the same value as | ||||||
``max_eps``. | ||||||
The maximum distance between two samples for one to be considered as | ||||||
in the neighborhood of the other. By default it assumes the same value | ||||||
as ``max_eps``. | ||||||
Used only when ``cluster_method='dbscan'``. | ||||||
|
||||||
xi : float, between 0 and 1, optional (default=0.05) | ||||||
|
@@ -219,8 +221,10 @@ def fit(self, X, y=None): | |||||
|
||||||
Parameters | ||||||
---------- | ||||||
X : array, shape (n_samples, n_features) | ||||||
The data. | ||||||
X : array, shape (n_samples, n_features), or (n_samples, n_samples) \ | ||||||
if metric=’precomputed’. | ||||||
A feature array, or array of distances between samples if | ||||||
metric='precomputed'. | ||||||
|
||||||
y : ignored | ||||||
|
||||||
|
@@ -332,31 +336,32 @@ def compute_optics_graph(X, min_samples, max_eps, metric, p, metric_params, | |||||
|
||||||
Parameters | ||||||
---------- | ||||||
X : array, shape (n_samples, n_features) | ||||||
The data. | ||||||
X : array, shape (n_samples, n_features), or (n_samples, n_samples) \ | ||||||
if metric=’precomputed’. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
A feature array, or array of distances between samples if | ||||||
metric='precomputed' | ||||||
|
||||||
min_samples : int (default=5) | ||||||
The number of samples in a neighborhood for a point to be considered | ||||||
as a core point. Expressed as an absolute number or a fraction of the | ||||||
number of samples (rounded to be at least 2). | ||||||
|
||||||
max_eps : float, optional (default=np.inf) | ||||||
The maximum distance between two samples for them to be considered | ||||||
as in the same neighborhood. Default value of "np.inf" will identify | ||||||
clusters across all scales; reducing `max_eps` will result in | ||||||
shorter run times. | ||||||
The maximum distance between two samples for one to be considered as | ||||||
in the neighborhood of the other. Default value of ``np.inf`` will | ||||||
identify clusters across all scales; reducing ``max_eps`` will result | ||||||
in shorter run times. | ||||||
|
||||||
metric : string or callable, optional (default='minkowski') | ||||||
metric to use for distance computation. Any metric from scikit-learn | ||||||
Metric to use for distance computation. Any metric from scikit-learn | ||||||
or scipy.spatial.distance can be used. | ||||||
|
||||||
If metric is a callable function, it is called on each | ||||||
pair of instances (rows) and the resulting value recorded. The callable | ||||||
should take two arrays as input and return one value indicating the | ||||||
distance between them. This works for Scipy's metrics, but is less | ||||||
efficient than passing the metric name as a string. | ||||||
|
||||||
Distance matrices are not supported. | ||||||
efficient than passing the metric name as a string. If metric is | ||||||
"precomputed", X is assumed to be a distance matrix and must be square. | ||||||
|
||||||
Valid values for metric are: | ||||||
|
||||||
|
@@ -771,8 +776,7 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples, | |||||
clusters. | ||||||
""" | ||||||
|
||||||
# all indices are inclusive (specially at the end) | ||||||
# add an inf to the end of reachability plot | ||||||
# Our implementation adds an inf to the end of reachability plot | ||||||
# this helps to find potential clusters at the end of the | ||||||
# reachability plot even if there's no upward region at the end of it. | ||||||
reachability_plot = np.hstack((reachability_plot, np.inf)) | ||||||
|
@@ -783,6 +787,10 @@ def _xi_cluster(reachability_plot, predecessor_plot, ordering, xi, min_samples, | |||||
index = 0 | ||||||
mib = 0. # maximum in between, section 4.3.2 | ||||||
|
||||||
# Our implementation corrects a mistake in the original | ||||||
# paper, i.e., in Definition 9 steep downward point, | ||||||
# r(p) * (1 - x1) <= r(p + 1) should be | ||||||
# r(p) * (1 - x1) >= r(p + 1) | ||||||
with np.errstate(invalid='ignore'): | ||||||
ratio = reachability_plot[:-1] / reachability_plot[1:] | ||||||
steep_upward = ratio <= xi_complement | ||||||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this indented... I don't see any reason to have it flush left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #5078, we do similar things in some latest classes, e.g., ColumnTransformer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I had time to pull this into an ancient discussion of this at numpydoc, I would... Thanks