@@ -55,20 +55,20 @@ Overview of clustering methods
55
55
- number of clusters
56
56
- Very large `n_samples `, medium `n_clusters ` with
57
57
:ref: `MiniBatch code <mini_batch_kmeans >`
58
- - General-purpose, even cluster size, flat geometry, not too many clusters
59
- - Distances between points
58
+ - General-purpose, even cluster size, flat geometry, not too many clusters
59
+ - Distances between points
60
60
61
61
* - :ref: `Affinity propagation <affinity_propagation >`
62
- - damping, sample preference
62
+ - damping, sample preference
63
63
- Not scalable with n_samples
64
64
- Many clusters, uneven cluster size, non-flat geometry
65
65
- Graph distance (e.g. nearest-neighbor graph)
66
66
67
67
* - :ref: `Mean-shift <mean_shift >`
68
- - bandwidth
68
+ - bandwidth
69
69
- Not scalable with n_samples
70
70
- Many clusters, uneven cluster size, non-flat geometry
71
- - Distances between points
71
+ - Distances between points
72
72
73
73
* - :ref: `Spectral clustering <spectral_clustering >`
74
74
- number of clusters
@@ -80,13 +80,13 @@ Overview of clustering methods
80
80
- number of clusters
81
81
- Large `n_samples ` and `n_clusters `
82
82
- Many clusters, possibly connectivity constraints
83
- - Distances between points
83
+ - Distances between points
84
84
85
85
* - :ref: `DBSCAN <dbscan >`
86
86
- neighborhood size
87
87
- Very large `n_samples `, medium `n_clusters `
88
88
- Non-flat geometry, uneven cluster sizes
89
- - Distances between nearest points
89
+ - Distances between nearest points
90
90
91
91
* - :ref: `Gaussian mixtures <mixture >`
92
92
- many
@@ -116,7 +116,7 @@ be specified. It scales well to large number of samples and has been used
116
116
across a large range of application areas in many different fields. It is
117
117
also equivalent to the expectation-maximization algorithm when setting the
118
118
covariance matrix to be diagonal, equal and small. The K-means algorithm
119
- aims to choose centroids :math: `C` that minimise the within cluster sum of
119
+ aims to choose centroids :math: `C` that minimise the within cluster sum of
120
120
squares objective function with a dataset :math: `X` with :math: `n` samples:
121
121
122
122
.. math :: J(X, C) = \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_j - \mu_i||^2)
@@ -156,8 +156,8 @@ centroids to be (generally) distant from each other, leading to provably better
156
156
results than random initialisation.
157
157
158
158
A parameter can be given to allow K-means to be run in parallel, called
159
- `n_jobs `. Giving this parameter a positive value uses that many processors
160
- (default=1). A value of -1 uses all processors, with -2 using one less, and so
159
+ `n_jobs `. Giving this parameter a positive value uses that many processors
160
+ (default=1). A value of -1 uses all processors, with -2 using one less, and so
161
161
on. Parallelization generally speeds up computation at the cost of memory (in
162
162
this case, multiple copies of centroids need to be stored, one for each job).
163
163
@@ -500,16 +500,17 @@ separated by areas of low density. Due to this rather generic view, clusters
500
500
found by DBSCAN can be any shape, as opposed to k-means which assumes that
501
501
clusters are convex shaped. The central component to the DBSCAN is the concept
502
502
of *core samples *, which are samples that are in areas of high density. A
503
- cluster is therefore a set of core samples, each highly similar to each other
504
- and a set of non-core samples that are similar to a core sample (but are not
503
+ cluster is therefore a set of core samples, each close to each other
504
+ (measured by some distance measure)
505
+ and a set of non-core samples that are close to a core sample (but are not
505
506
themselves core samples). There are two parameters to the algorithm,
506
- `min_points ` and `eps `, which define formally what we mean when we say *dense *.
507
- A higher `min_points ` or lower `eps ` indicate higher density necessary to form
507
+ `min_samples ` and `eps `, which define formally what we mean when we say *dense *.
508
+ A higher `min_samples ` or lower `eps ` indicate higher density necessary to form
508
509
a cluster.
509
510
510
511
More formally, we define a core sample as being a sample in the dataset such
511
- that there exists `min_samples ` other samples with a similarity higher than
512
- `eps ` to it , which are defined as *neighbors * of the core sample. This tells
512
+ that there exist `min_samples ` other samples within a distance of
513
+ `eps `, which are defined as *neighbors * of the core sample. This tells
513
514
us that the core sample is in a dense area of the vector space. A cluster
514
515
is a set of core samples, that can be built by recursively by taking a core
515
516
sample, finding all of its neighbors that are core samples, finding all of
@@ -520,24 +521,24 @@ are on the fringes of a cluster.
520
521
521
522
Any core sample is part of a cluster, by definition. Further, any cluster has
522
523
at least `min_samples ` points in it, following the definition of a core
523
- sample. For any sample that is not a core sample, and does not have a
524
- similarity higher than `eps ` to a core sample, it is considered an outlier by
524
+ sample. For any sample that is not a core sample, and does have a
525
+ distance higher than `eps ` to any core sample, it is considered an outlier by
525
526
the algorithm.
526
527
527
528
The algorithm is non-deterministic, however the core samples themselves will
528
529
always belong to the same clusters (although the labels themselves may be
529
530
different). The non-determinism comes from deciding on which cluster a
530
- non-core sample belongs to. A non-core sample can be have a similarity higher
531
+ non-core sample belongs to. A non-core sample can have a distance lower
531
532
than `eps ` to two core samples in different classes. Following from the
532
- triangular inequality, those two core samples would be less similar than
533
+ triangular inequality, those two core samples would be more distant than
533
534
`eps ` from each other -- else they would be in the same class. The non-core
534
535
sample is simply assigned to which ever cluster is generated first, where
535
536
the order is determined randomly within the code. Other than the ordering of,
536
537
the dataset, the algorithm is deterministic, making the results relatively
537
538
stable between iterations on the same data.
538
539
539
540
In the figure below, the color indicates cluster membership, with large circles
540
- indicating core samples found by the algorithm. Smaller circles are non-core
541
+ indicating core samples found by the algorithm. Smaller circles are non-core
541
542
samples that are still part of a cluster. Moreover, the outliers are indicated
542
543
by black points below.
543
544
@@ -819,7 +820,7 @@ Drawbacks
819
820
820
821
* :ref: `example_cluster_plot_adjusted_for_chance_measures.py `: Analysis of
821
822
the impact of the dataset size on the value of clustering measures
822
- for random assignments. This example also includes the Adjusted Rand
823
+ for random assignments. This example also includes the Adjusted Rand
823
824
Index.
824
825
825
826
@@ -864,7 +865,7 @@ following equation, from Vinh, Epps, and Bailey, (2009). In this equation,
864
865
\f rac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!
865
866
(N-a_i-b_j+n_{ij})!}
866
867
867
- Using the expected value, the adjusted mutual information can then be
868
+ Using the expected value, the adjusted mutual information can then be
868
869
calculated using a similar form to that of the adjusted Rand index:
869
870
870
871
.. math :: \text{AMI} = \frac{\text{MI} - E[\text{MI}]}{\max(H(U), H(V)) - E[\text{MI}]}
@@ -875,7 +876,7 @@ calculated using a similar form to that of the adjusted Rand index:
875
876
knowledge reuse framework for combining multiple partitions". Journal of
876
877
Machine Learning Research 3: 583–617. doi:10.1162/153244303321897735
877
878
878
- * Vinh, Epps, and Bailey, (2009). "Information theoretic measures
879
+ * Vinh, Epps, and Bailey, (2009). "Information theoretic measures
879
880
for clusterings comparison". Proceedings of the 26th Annual International
880
881
Conference on Machine Learning - ICML '09.
881
882
doi:10.1145/1553374.1553511. ISBN 9781605585161.
@@ -1045,7 +1046,7 @@ mean of homogeneity and completeness**:
1045
1046
1046
1047
.. [B2011 ] `Identication and Characterization of Events in Social Media
1047
1048
<http://www.cs.columbia.edu/~hila/hila-thesis-distributed.pdf> `_, Hila
1048
- Becker, PhD Thesis.
1049
+ Becker, PhD Thesis.
1049
1050
1050
1051
.. _silhouette_coefficient :
1051
1052
@@ -1073,7 +1074,7 @@ The Silhoeutte Coefficient *s* for a single sample is then given as:
1073
1074
1074
1075
.. math :: s = \frac{b - a}{max(a, b)}
1075
1076
1076
- The Silhouette Coefficient for a set of samples is given as the mean of the
1077
+ The Silhouette Coefficient for a set of samples is given as the mean of the
1077
1078
Silhouette Coefficient for each sample.
1078
1079
1079
1080
@@ -1091,7 +1092,7 @@ cluster analysis.
1091
1092
>>> from sklearn.cluster import KMeans
1092
1093
>>> kmeans_model = KMeans(n_clusters = 3 , random_state = 1 ).fit(X)
1093
1094
>>> labels = kmeans_model.labels_
1094
- >>> metrics.silhouette_score(X, labels, metric = ' euclidean' )
1095
+ >>> metrics.silhouette_score(X, labels, metric = ' euclidean' )
1095
1096
... # doctest: +ELLIPSIS
1096
1097
0.55...
1097
1098
0 commit comments