@@ -8,10 +8,10 @@ User guide
8
8
==========
9
9
10
10
.. toctree ::
11
- :numbered:
11
+ :numbered:
12
12
13
- modules/eigenpro.rst
14
- modules/robust.rst
13
+ modules/eigenpro.rst
14
+ modules/robust.rst
15
15
16
16
.. _k_medoids :
17
17
@@ -44,8 +44,8 @@ clusters. This makes it more suitable for smaller datasets in comparison to
44
44
45
45
.. topic :: Examples:
46
46
47
- * :ref: `sphx_glr_auto_examples_plot_kmedoids_digits.py `: Applying K-Medoids on digits
48
- with various distance metrics.
47
+ * :ref: `sphx_glr_auto_examples_plot_kmedoids_digits.py `: Applying K-Medoids on digits
48
+ with various distance metrics.
49
49
50
50
51
51
**Algorithm description: **
@@ -64,7 +64,126 @@ This version works as follows:
64
64
maximum number of iterations ``max_iter `` is reached.
65
65
66
66
.. topic :: References:
67
- * Maranzana, F.E., 1963. On the location of supply points to minimize
68
- transportation costs. IBM Systems Journal, 2(2), pp.129-135.
69
- * Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
70
- clustering. Expert systems with applications, 36(2), pp.3336-3341.
67
+
68
+ * Maranzana, F.E., 1963. On the location of supply points to minimize
69
+ transportation costs. IBM Systems Journal, 2(2), pp.129-135.
70
+ * Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
71
+ clustering. Expert systems with applications, 36(2), pp.3336-3341.
72
+
73
+ .. _commonnn :
74
+
75
+ Common-nearest-neighbors clustering
76
+ ===================================
77
+
78
+ :class: `CommonNNClustering <sklearn_extra.cluster.CommonNNClustering> `
79
+ provides an interface to density-based
80
+ common-nearest-neighbors clustering. Density-based clustering identifies
81
+ clusters as dense regions of high point density, separated by sparse
82
+ regions of lower density. Common-nearest-neighbors clustering
83
+ approximates local density as the number of shared (common) neighbors
84
+ between two points with respect to a neighbor search radius. A density
85
+ threshold (density criterion) is used – defined by the cluster
86
+ parameters ``min_samples `` (number of common neighbors) and ``eps `` (search
87
+ radius) – to distinguish high from low density. A high value of
88
+ ``min_samples `` and a low value of ``eps `` corresponds to high density.
89
+
90
+ As such the method is related to other density-based cluster algorithms
91
+ like :class: `DBSCAN <sklearn.cluster.DBSCAN> ` or Jarvis-Patrick. DBSCAN
92
+ approximates local density as the number of points in the neighborhood
93
+ of a single point. The Jarvis-Patrick algorithm uses the number of
94
+ common neighbors shared by two points among the :math: `k` nearest neighbors.
95
+ As these approaches each provide a different notion of how density is
96
+ estimated from point samples, they can be used complementarily. Their
97
+ relative suitability for a classification problem depends on the nature
98
+ of the clustered data. Common-nearest-neighbors clustering (as
99
+ density-based clustering in general) has the following advantages over
100
+ other clustering techniques:
101
+
102
+ * The cluster result is deterministic. The same set of cluster
103
+ parameters always leads to the same classification for a data set.
104
+ A different ordering of the data set leads to a different ordering
105
+ of the cluster assignment, but does not change the assignment
106
+ qualitatively.
107
+ * Little prior knowledge about the data is required, e.g. the number
108
+ of resulting clusters does not need to be known beforehand (although
109
+ cluster parameters need to be tuned to obtain a desired result).
110
+ * Identified clusters are not restricted in their shape or size.
111
+ * Points can be considered noise (outliers) if they do not fullfil
112
+ the density criterion.
113
+
114
+ The common-nearest-neighbors algorithm tests the density criterion for
115
+ pairs of neighbors (do they have at least ``min_samples `` points in the
116
+ intersection of their neighborhoods at a radius ``eps ``). Two points that
117
+ fullfil this criterion are directly part of the same dense data region,
118
+ i.e. they are *density reachable *. A *density connected * network of
119
+ density reachable points (a connected component if density reachability
120
+ is viewed as a graph structure) constitutes a separated dense region and
121
+ therefore a cluster. Note, that for example in contrast to
122
+ :class: `DBSCAN <sklearn.cluster.DBSCAN> ` there is no differentiation in
123
+ *core * (dense points) and *edge * points (points that are not dense
124
+ themselves but neighbors of dense points). The assignment of points on
125
+ the cluster rims to a cluster is possible, but can be ambiguous. The
126
+ cluster result is returned as a 1D container of labels, i.e. a sequence
127
+ of integers (zero-based) of length :math: `n` for a data set of :math: `n`
128
+ points,
129
+ denoting the assignment of points to a specific cluster. Noise is
130
+ labeled with ``-1 ``. Valid clusters have at least two members. The
131
+ clusters are not sorted by cluster member count. In same cases the
132
+ algorithm tends to identify small clusters that can be filtered out
133
+ manually.
134
+
135
+ .. topic :: Examples:
136
+
137
+ * :ref: `examples/cluster/plot_commonnn.py <sphx_glr_auto_examples_plot_commonnn.py >`
138
+ Basic usage of the
139
+ :class: `CommonNNClustering <sklearn_extra.cluster.CommonNNClustering> `
140
+ * :ref: `examples/cluster/plot_commonnn_data_sets.py <sphx_glr_auto_examples_plot_commonnn_data_sets.py >`
141
+ Common-nearest-neighbors clustering of toy data sets
142
+
143
+ .. topic :: Implementation:
144
+
145
+ The present implementation of the common-nearest-neighbors algorithm in
146
+ :class: `CommonNNClustering <sklearn_extra.cluster.CommonNNClustering> `
147
+ shares some
148
+ commonalities with the current
149
+ scikit-learn implementation of :class: `DBSCAN <sklearn.cluster.DBSCAN> `.
150
+ It computes neighborhoods from points in bulk with
151
+ :class: `NearestNeighbors <sklearn.neighbors.NearestNeighbors> ` before
152
+ the actual clustering. Consequently, to store the neighborhoods
153
+ it requires memory on the order of
154
+ :math: `O(n ⋅ n_n)` for :math: `n` points in the data set where :math: `n_n`
155
+ is the
156
+ average number of neighbors (which is proportional to ``eps ``), that is at
157
+ worst :math: `O(n^2 )`. Depending on the input structure (dense or sparse
158
+ points or similarity matrix) the additional memory demand varies.
159
+ The clustering itself follows a
160
+ breadth-first-search scheme, checking the density criterion at every
161
+ node expansion. The linear time complexity is roughly proportional to
162
+ the number of data points :math: `n`, the total number of neighbors :math: `N`
163
+ and the value of ``min_samples ``. For density-based clustering
164
+ schemes with lower memory demand, also consider:
165
+
166
+ * :class: `OPTICS <sklearn.cluster.OPTICS> ` – Density-based clustering
167
+ related to DBSCAN using a ``eps `` value range.
168
+ * `cnnclustering <https://pypi.org/project/cnnclustering/ >`_ – A
169
+ different implementation of common-nearest-neighbors clustering.
170
+
171
+ .. topic :: Notes:
172
+
173
+ * :class: `DBSCAN <sklearn.cluster.DBSCAN> ` provides an option to
174
+ specify data point weights with ``sample_weights ``. This feature is
175
+ experimentally at the moment for :class: `CommonNNClustering ` as
176
+ weights are not well defined for checking the common-nearest-neighbor
177
+ density criterion. It should not be used in production, yet.
178
+
179
+ .. topic :: References:
180
+
181
+ * B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and
182
+ Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem.
183
+ Phys., 2010, 132, 074110.
184
+
185
+ * O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the
186
+ Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104.
187
+
188
+ * O. Lemke, B.G. Keller "Common nearest neighbor clustering - a
189
+ benchmark" Algorithms, 2018, 11, 19.
0 commit comments