8000 ENH Add density based CNN clustering to sklearn_extra.cluster (#64) · rth/scikit-learn-extra@48bff71 · GitHub
[go: up one dir, main page]

Skip to content

Commit 48bff71

Browse files
authored
ENH Add density based CNN clustering to sklearn_extra.cluster (scikit-learn-contrib#64)
1 parent 476f11f commit 48bff71

File tree

11 files changed

+1336
-13
lines changed

11 files changed

+1336
-13
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ __pycache__/
44
*$py.class
55

66
*.c
7+
*.cpp
78

89
# C extensions
910
*.so

doc/api.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ Clustering
3131
:template: class.rst
3232

3333
cluster.KMedoids
34-
34+
cluster.CommonNNClustering
35+
3536
Robust
3637
====================
3738

doc/user_guide.rst

Lines changed: 128 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ User guide
88
==========
99

1010
.. toctree::
11-
:numbered:
11+
:numbered:
1212

13-
modules/eigenpro.rst
14-
modules/robust.rst
13+
modules/eigenpro.rst
14+
modules/robust.rst
1515

1616
.. _k_medoids:
1717

@@ -44,8 +44,8 @@ clusters. This makes it more suitable for smaller datasets in comparison to
4444

4545
.. topic:: Examples:
4646

47-
* :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits
48-
with various distance metrics.
47+
* :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits
48+
with various distance metrics.
4949

5050

5151
**Algorithm description:**
@@ -64,7 +64,126 @@ This version works as follows:
6464
maximum number of iterations ``max_iter`` is reached.
6565

6666
.. topic:: References:
67-
* Maranzana, F.E., 1963. On the location of supply points to minimize
68-
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
69-
* Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
70-
clustering. Expert systems with applications, 36(2), pp.3336-3341.
67+
68+
* Maranzana, F.E., 1963. On the location of supply points to minimize
69+
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
70+
* Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
71+
clustering. Expert systems with applications, 36(2), pp.3336-3341.
72+
73+
.. _commonnn:
74+
75+
Common-nearest-neighbors clustering
76+
===================================
77+
78+
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
79+
provides an interface to density-based
80+
common-nearest-neighbors clustering. Density-based clustering identifies
81+
clusters as dense regions of high point density, separated by sparse
82+
regions of lower density. Common-nearest-neighbors clustering
83+
approximates local density as the number of shared (common) neighbors
84+
between two points with respect to a neighbor search radius. A density
85+
threshold (density criterion) is used – defined by the cluster
86+
parameters ``min_samples`` (number of common neighbors) and ``eps`` (search
87+
radius) – to distinguish high from low density. A high value of
88+
``min_samples`` and a low value of ``eps`` corresponds to high density.
89+
90+
As such the method is related to other density-based cluster algorithms
91+
like :class:`DBSCAN <sklearn.cluster.DBSCAN>` or Jarvis-Patrick. DBSCAN
92+
approximates local density as the number of points in the neighborhood
93+
of a single point. The Jarvis-Patrick algorithm uses the number of
94+
common neighbors shared by two points among the :math:`k` nearest neighbors.
95+
As these approaches each provide a different notion of how density is
96+
estimated from point samples, they can be used complementarily. Their
97+
relative suitability for a classification problem depends on the nature
98+
of the clustered data. Common-nearest-neighbors clustering (as
99+
density-based clustering in general) has the following advantages over
100+
other clustering techniques:
101+
102+
* The cluster result is deterministic. The same set of cluster
103+
parameters always leads to the same classification for a data set.
104+
A different ordering of the data set leads to a different ordering
105+
of the cluster assignment, but does not change the assignment
106+
qualitatively.
107+
* Little prior knowledge about the data is required, e.g. the number
108+
of resulting clusters does not need to be known beforehand (although
109+
cluster parameters need to be tuned to obtain a desired result).
110+
* Identified clusters are not restricted in their shape or size.
111+
* Points can be considered noise (outliers) if they do not fullfil
112+
the density criterion.
113+
114+
The common-nearest-neighbors algorithm tests the density criterion for
115+
pairs of neighbors (do they have at least ``min_samples`` points in the
116+
intersection of their neighborhoods at a radius ``eps``). Two points that
117+
fullfil this criterion are directly part of the same dense data region,
118+
i.e. they are *density reachable*. A *density connected* network of
119+
density reachable points (a connected component if density reachability
120+
is viewed as a graph structure) constitutes a separated dense region and
121+
therefore a cluster. Note, that for example in contrast to
122+
:class:`DBSCAN <sklearn.cluster.DBSCAN>` there is no differentiation in
123+
*core* (dense points) and *edge* points (points that are not dense
124+
themselves but neighbors of dense points). The assignment of points on
125+
the cluster rims to a cluster is possible, but can be ambiguous. The
126+
cluster result is returned as a 1D container of labels, i.e. a sequence
127+
of integers (zero-based) of length :math:`n` for a data set of :math:`n`
128+
points,
129+
denoting the assignment of points to a specific cluster. Noise is
130+
labeled with ``-1``. Valid clusters have at least two members. The
131+
clusters are not sorted by cluster member count. In same cases the
132+
algorithm tends to identify small clusters that can be filtered out
133+
manually.
134+
135+
.. topic:: Examples:
136+
137+
* :ref:`examples/cluster/plot_commonnn.py <sphx_glr_auto_examples_plot_commonnn.py>`
138+
Basic usage of the
139+
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
140+
* :ref:`examples/cluster/plot_commonnn_data_sets.py <sphx_glr_auto_examples_plot_commonnn_data_sets.py>`
141+
Common-nearest-neighbors clustering of toy data sets
142+
143+
.. topic:: Implementation:
144+
145+
The present implementation of the common-nearest-neighbors algorithm in
146+
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
147+
shares some
148+
commonalities with the current
149+
scikit-learn implementation of :class:`DBSCAN <sklearn.cluster.DBSCAN>`.
150+
It computes neighborhoods from points in bulk with
151+
:class:`NearestNeighbors <sklearn.neighbors.NearestNeighbors>` before
152+
the actual clustering. Consequently, to store the neighborhoods
153+
it requires memory on the order of
154+
:math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n`
155+
is the
156+
average number of neighbors (which is proportional to ``eps``), that is at
157+
worst :math:`O(n^2)`. Depending on the input structure (dense or sparse
158+
points or similarity matrix) the additional memory demand varies.
159+
The clustering itself follows a
160+
breadth-first-search scheme, checking the density criterion at every
161+
node expansion. The linear time complexity is roughly proportional to
162+
the number of data points :math:`n`, the total number of neighbors :math:`N`
163+
and the value of ``min_samples``. For density-based clustering
164+
schemes with lower memory demand, also consider:
165+
166+
* :class:`OPTICS <sklearn.cluster.OPTICS>` – Density-based clustering
167+
related to DBSCAN using a ``eps`` value range.
168+
* `cnnclustering <https://pypi.org/project/cnnclustering/>`_ – A
169+
different implementation of common-nearest-neighbors clustering.
170+
171+
.. topic:: Notes:
172+
173+
* :class:`DBSCAN <sklearn.cluster.DBSCAN>` provides an option to
174+
specify data point weights with ``sample_weights``. This feature is
175+
experimentally at the moment for :class:`CommonNNClustering` as
176+
weights are not well defined for checking the common-nearest-neighbor
177+
density criterion. It should not be used in production, yet.
178+
179+
.. topic:: References:
180+
181+
* B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and
182+
Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem.
183+
Phys., 2010, 132, 074110.
184+
185+
* O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the
186+
Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104.
187+
188+
* O. Lemke, B.G. Keller "Common nearest neighbor clustering - a
189+
benchmark" Algorithms, 2018, 11, 19.

examples/plot_commonnn.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
=========================================
4+
Common-nearest-neighbor clustering demo I
5+
=========================================
6+
7+
Common-nearest neighbor clustering of data points following a density
8+
criterion. Two points will be part of the same cluster if they share a
9+
minimum number of common neighbors. Read more in the :ref:`User Guide
10+
<commonnn>`.
11+
12+
"""
13+
import matplotlib.pyplot as plt
14+
import numpy as np
15+
16+
from sklearn_extra.cluster import CommonNNClustering
17+
from sklearn import metrics
18+
from sklearn.datasets import make_blobs
19+
from sklearn.preprocessing import StandardScaler
20+
21+
22+
print(__doc__)
23+
24+
# #############################################################################
25+
# Generate sample data
26+
centers = [[1, 1], [-1, -1], [1, -1]]
27+
X, labels_true = make_blobs(
28+
n_samples=750, centers=centers, cluster_std=0.4, random_state=0
29+
)
30+
31+
X = StandardScaler().fit_transform(X)
32+
33+
# #############################################################################
34+
# Compute common-nearest-neighbor clustering
35+
cobj = CommonNNClustering(eps=0.3, min_samples=8).fit(X)
36+
labels = cobj.labels_
37+
38+
# Number of clusters in labels, ignoring noise if present.
39+
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
40+
n_noise_ = list(labels).count(-1)
41+
42+
print("Estimated number of clusters: %d" % n_clusters_)
43+
print("Estimated number of noise points: %d" % n_noise_)
44+
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
45+
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
46+
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
47+
print(
48+
"Adjusted Rand Index: %0.3f"
49+
% metrics.adjusted_rand_score(labels_true, labels)
50+
)
51+
print(
52+
"Adjusted Mutual Information: %0.3f"
53+
% metrics.adjusted_mutual_info_score(labels_true, labels)
54+
)
55+
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
56+
57+
# #############################################################################
58+
# Plot result
59+
60+
# Black removed and is used for noise instead.
61+
unique_labels = set(labels)
62+
colors = [
63+
plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))
64+
]
65+
for k, col in zip(unique_labels, colors):
66+
if k == -1:
67+
# Black used for noise.
68+
col = [0, 0, 0, 1]
69+
70+
class_member_mask = labels == k
71+
72+
xy = X[class_member_mask]
73+
plt.plot(
74+
xy[:, 0],
75+
xy[:, 1],
76+
"o",
77+
markerfacecolor=tuple(col),
78+
markeredgecolor="k",
79+
markersize=6,
80+
)
81+
82+
plt.title("Estimated number of clusters: %d" % n_clusters_)

examples/plot_commonnn_data_sets.py

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
"""
2+
==========================================
3+
Common-nearest-neighbor clustering demo II
4+
==========================================
5+
6+
Common-nearest neighbor clustering of data points following a density
7+
criterion. Two points will be part of the same cluster if they share a
8+
minimum number of common neighbors. Read more in the :ref:`User Guide
9+
<commonnn>`. Compare this example to the results for
10+
`other <https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html>`_
11+
cluster algorithms.
12+
13+
"""
14+
15+
import matplotlib.pyplot as plt
16+
import numpy as np
17+
18+
from sklearn_extra.cluster import CommonNNClustering
19+
from sklearn import datasets
20+
from sklearn.preprocessing import StandardScaler
21+
22+
23+
print(__doc__)
24+
25+
26+
np.random.seed(42)
27+
n = 2000
28+
29+
# circles
30+
circles, _ = datasets.make_circles(
31+
n_samples=n, factor=0.5, noise=0.05, random_state=10
32+
)
33+
34+
circles = StandardScaler().fit_transform(circles)
35+
36+
# blobs
37+
blobs, _ = datasets.make_blobs(
38+
centers=[[-9, -8], [11, -10], [12, 12]], n_samples=n, random_state=10
39+
)
40+
41+
blobs = StandardScaler().fit_transform(blobs)
42+
43+
# moons
44+
moons, _ = datasets.make_moons(n_samples=n, noise=0.05, random_state=10)
45+
46+
moons = StandardScaler().fit_transform(moons)
47+
48+
# no_structure
49+
no_structure = np.random.rand(n, 2)
50+
no_structure = StandardScaler().fit_transform(no_structure)
51+
52+
# aniso
53+
X, y = datasets.make_blobs(n_samples=n, random_state=170)
54+
55+
transformation = [[0.6, -0.6], [-0.4, 0.8]]
56+
aniso = np.dot(X, transformation)
57+
aniso = StandardScaler().fit_transform(aniso)
58+
59+
# varied
60+
varied, _ = datasets.make_blobs(
61+
n_samples=n, cluster_std=[1.0, 2, 0.5], random_state=170
62+
)
63+
64+
varied = StandardScaler().fit_transform(varied)
65+
66+
fits = [
67+
("circles", circles, {"eps": 0.2, "min_samples": 5}),
68+
("moons", moons, {"eps": 0.2, "min_samples": 5}),
69+
("varied", varied, {"eps": 0.2, "min_samples": 15}),
70+
("aniso", aniso, {"eps": 0.18, "min_samples": 12}),
71+
("blobs", blobs, {"eps": 0.2, "min_samples": 5}),
72+
("none", no_structure, {"eps": 0.2, "min_samples": 5}),
73+
]
74+
75+
fig, ax = plt.subplots(2, 3)
76+
ax = ax.flatten()
77+
for index, (name, data, params) in enumerate(fits):
78+
cobj = CommonNNClustering(**params).fit(data)
79+
labels = cobj.labels_
80+
ax[index].plot(
81+
*data[np.where(labels == -1)[0]].T,
82+
linestyle="",
83+
color="None",
84+
marker="o",
85+
markersize=4,
86+
markerfacecolor="gray",
87+
markeredgecolor="k",
88+
)
89+
90+
for cluster_number in range(0, int(np.max(labels)) + 1):
91+
ax[index].plot(
92+
*data[np.where(labels == cluster_number)[0]].T,
93+
linestyle="",
94+
marker="o",
95+
markersize=4,
96+
markeredgecolor="k",
97+
)
98+
99+
ax[index].set(
100+
**{
101+
"xlabel": None,
102+
"ylabel": None,
103+
"xlim": (-2.5, 2.5),
104+
"ylim": (-2.5, 2.5),
105+
"xticks": (),
106+
"yticks": (),
107+
"aspect": "equal",
108+
"title": name,
109+
}
110+
)

setup.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,13 @@
6262
"sklearn_extra.utils._cyfht",
6363
["sklearn_extra/utils/_cyfht.pyx"],
6464
include_dirs=[np.get_include()],
65-
)
65+
),
66+
Extension(
67+
"sklearn_extra.cluster._commonnn_inner",
68+
["sklearn_extra/cluster/_commonnn_inner.pyx"],
69+
include_dirs=[np.get_include()],
70+
language="c++",
71+
),
6672
]
6773
),
6874
"cmdclass": dict(build_ext=build_ext),

sklearn_extra/cluster/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
from ._k_medoids import KMedoids
2+
from ._commonnn import commonnn, CommonNNClustering
23

3-
__all__ = ["KMedoids"]
4+
__all__ = ["KMedoids", "CommonNNClustering", "commonnn"]

0 commit comments

Comments
 (0)
0