8000 DOC Make MeanShift documentation clearer (#25305) · scikit-learn/scikit-learn@e2318ec · GitHub
[go: up one dir, main page]

Skip to content

Commit e2318ec

Browse files
remilvusglemaitre
andauthored
DOC Make MeanShift documentation clearer (#25305)
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
1 parent ed9629a commit e2318ec

File tree

2 files changed

+22
-8
lines changed

2 files changed

+22
-8
lines changed

doc/modules/clustering.rst

+21-7
Original file line numberDiff line numberDiff line change
@@ -392,22 +392,36 @@ for centroids to be the mean of the points within a given region. These
392392
candidates are then filtered in a post-processing stage to eliminate
393393
near-duplicates to form the final set of centroids.
394394

395-
Given a candidate centroid :math:`x_i` for iteration :math:`t`, the candidate
395+
The position of centroid candidates is iteratively adjusted using a technique called hill
396+
climbing, which finds local maxima of the estimated probability density.
397+
Given a candidate centroid :math:`x` for iteration :math:`t`, the candidate
396398
is updated according to the following equation:
397399

398400
.. math::
399401
400-
x_i^{t+1} = m(x_i^t)
402+
x^{t+1} = x^t + m(x^t)
401403
402-
Where :math:`N(x_i)` is the neighborhood of samples within a given distance
403-
around :math:`x_i` and :math:`m` is the *mean shift* vector that is computed for each
404+
Where :math:`m` is the *mean shift* vector that is computed for each
404405
centroid that points towards a region of the maximum increase in the density of points.
405-
This is computed using the following equation, effectively updating a centroid
406-
to be the mean of the samples within its neighborhood:
406+
To compute :math:`m` we define :math:`N(x)` as the neighborhood of samples within
407+
a given distance around :math:`x`. Then :math:`m` is computed using the following
408+
equation, effectively updating a centroid to be the mean of the samples within
409+
its neighborhood:
407410

408411
.. math::
409412
410-
m(x_i) = \frac{\sum_{x_j \in N(x_i)}K(x_j - x_i)x_j}{\sum_{x_j \in N(x_i)}K(x_j - x_i)}
413+
m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x
414+
415+
In general, the equation for :math:`m` depends on a kernel used for density estimation.
416+
The generic formula is:
417+
418+
.. math::
419+
420+
m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x
421+
422+
In our implementation, :math:`K(x)` is equal to 1 if :math:`x` is small enough and is
423+
equal to 0 otherwise. Effectively :math:`K(y - x)` indicates whether :math:`y` is in
424+
the neighborhood of :math:`x`.
411425

412426
The algorithm automatically sets the number of clusters, instead of relying on a
413427
parameter ``bandwidth``, which dictates the size of the region to search through.

sklearn/cluster/_mean_shift.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -287,7 +287,7 @@ class MeanShift(ClusterMixin, BaseEstimator):
287287
Parameters
288288
----------
289289
bandwidth : float, default=None
290-
Bandwidth used in the RBF kernel.
290+
Bandwidth used in the flat kernel.
291291
292292
If not given, the bandwidth is estimated using
293293
sklearn.cluster.estimate_bandwidth; see the documentation for that

0 commit comments

Comments
 (0)
0