From 6c04aef0572a4e640bf40312affdfe24ba47b3e9 Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Thu, 5 Jan 2023 15:37:21 +0100 Subject: [PATCH 1/9] Make MeanShift explanation clearer --- doc/modules/clustering.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index 48a59e590a8e7..79b7d59098531 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -392,22 +392,22 @@ for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. -Given a candidate centroid :math:`x_i` for iteration :math:`t`, the candidate +Given a candidate centroid :math:`x` for iteration :math:`t`, the candidate is updated according to the following equation: .. math:: - x_i^{t+1} = m(x_i^t) + x^{t+1} = x^t + m(x^t) -Where :math:`N(x_i)` is the neighborhood of samples within a given distance -around :math:`x_i` and :math:`m` is the *mean shift* vector that is computed for each +Where :math:`N(x)` is the neighborhood of samples within a given distance +around :math:`x` and :math:`m` is the *mean shift* vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. This is computed using the following equation, effectively updating a centroid to be the mean of the samples within its neighborhood: .. math:: - m(x_i) = \frac{\sum_{x_j \in N(x_i)}K(x_j - x_i)x_j}{\sum_{x_j \in N(x_i)}K(x_j - x_i)} + m(x_i) = \frac{1}{|N(x_i)|} \sum_{x_j \in N(x_i)}x_j - x The algorithm automatically sets the number of clusters, instead of relying on a parameter ``bandwidth``, which dictates the size of the region to search through. From 646f5b6d1116c8f530fb4eb218e5c48705d9d248 Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Thu, 5 Jan 2023 15:40:59 +0100 Subject: [PATCH 2/9] Fix kernel name --- sklearn/cluster/_mean_shift.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sklearn/cluster/_mean_shift.py b/sklearn/cluster/_mean_shift.py index f886fe392f4bf..407b7b3012fa0 100644 --- a/sklearn/cluster/_mean_shift.py +++ b/sklearn/cluster/_mean_shift.py @@ -288,7 +288,7 @@ class MeanShift(ClusterMixin, BaseEstimator): Parameters ---------- bandwidth : float, default=None - Bandwidth used in the RBF kernel. + Bandwidth used in the flat kernel. If not given, the bandwidth is estimated using sklearn.cluster.estimate_bandwidth; see the documentation for that From 79f693ce18333c3641567d4c0dc9f126b1f3f267 Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Thu, 5 Jan 2023 16:47:03 +0100 Subject: [PATCH 3/9] Remove indices --- doc/modules/clustering.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index 79b7d59098531..23f8b601d53bf 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -407,7 +407,7 @@ to be the mean of the samples within its neighborhood: .. math:: - m(x_i) = \frac{1}{|N(x_i)|} \sum_{x_j \in N(x_i)}x_j - x + m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x The algorithm automatically sets the number of clusters, instead of relying on a parameter ``bandwidth``, which dictates the size of the region to search through. From 6f72fba2dc806275adac10e4343b55abf2768818 Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Sun, 22 Jan 2023 10:37:07 +0100 Subject: [PATCH 4/9] Restructure paragraph --- doc/modules/clustering.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index 23f8b601d53bf..145aee1ce2a2d 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -399,11 +399,12 @@ is updated according to the following equation: x^{t+1} = x^t + m(x^t) -Where :math:`N(x)` is the neighborhood of samples within a given distance -around :math:`x` and :math:`m` is the *mean shift* vector that is computed for each +Where :math:`m` is the *mean shift* vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. -This is computed using the following equation, effectively updating a centroid -to be the mean of the samples within its neighborhood: +To compute :math:`m` we define :math:`N(x)` as the neighborhood of samples within +a given distance around :math:`x`. Then :math:`m` is computed using the following +equation, effectively updating a centroid to be the mean of the samples within +its neighborhood: .. math:: From cf80f736e1c3ec0387b8fa1056578377ff1c26a2 Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Sun, 22 Jan 2023 10:37:44 +0100 Subject: [PATCH 5/9] Add multiplication sign --- doc/modules/clustering.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index 145aee1ce2a2d..2292206c445bb 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -397,7 +397,7 @@ is updated according to the following equation: .. math:: - x^{t+1} = x^t + m(x^t) + x^{t+1} = x^t + m * (x^t) Where :math:`m` is the *mean shift* vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. From eda2948412282f9170a8e5b53a9fb4c5d8ec1386 Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Sun, 22 Jan 2023 10:40:56 +0100 Subject: [PATCH 6/9] Remove multiplication sign --- doc/modules/clustering.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index 2292206c445bb..145aee1ce2a2d 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -397,7 +397,7 @@ is updated according to the following equation: .. math:: - x^{t+1} = x^t + m * (x^t) + x^{t+1} = x^t + m(x^t) Where :math:`m` is the *mean shift* vector that is computed for each centroid that points towards a region of the maximum increase in the density of points. From d94756541baedb6a20b6fc7d913253bc101bc57a Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Sun, 22 Jan 2023 10:51:58 +0100 Subject: [PATCH 7/9] Mention hill climbing --- doc/modules/clustering.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index 145aee1ce2a2d..72a8afcbc6be0 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -392,6 +392,8 @@ for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. +The position of centroid candidates is iteratively adjusted using a technique called hill +climbing, which finds local maxima of the estimated probability density. Given a candidate centroid :math:`x` for iteration :math:`t`, the candidate is updated according to the following equation: From 5c1944e4b3428275eb14dbe874009fcd3108ba73 Mon Sep 17 00:00:00 2001 From: Adam Kania <48769688+remilvus@users.noreply.github.com> Date: Sun, 22 Jan 2023 11:05:29 +0100 Subject: [PATCH 8/9] Add kernel explanation --- doc/modules/clustering.rst | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index 72a8afcbc6be0..b677fb33fc2f2 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -412,6 +412,17 @@ its neighborhood: m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x +In general, the equation for :math:`m` depends on a kernel used for density estimation. +The generic formula is: + +.. math:: + + m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x + +In our implementation, :math:`K(x)` is equal to 1 if :math:`x` is small enough and is +equal to 0 otherwise. Effectively :math:`K(y - x)` indicates whether :math:`y` is in +the neighborhood of :math:`x`. + The algorithm automatically sets the number of clusters, instead of relying on a parameter ``bandwidth``, which dictates the size of the region to search through. This parameter can be set manually, but can be estimated using the provided From 9a5ed894bdca0b55fc2c83431a252387c71c6ff2 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Mon, 23 Jan 2023 18:22:03 +0100 Subject: [PATCH 9/9] Update doc/modules/clustering.rst --- doc/modules/clustering.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst index b677fb33fc2f2..775f8c8180d14 100644 --- a/doc/modules/clustering.rst +++ b/doc/modules/clustering.rst @@ -410,14 +410,14 @@ its neighborhood: .. math:: - m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x + m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x In general, the equation for :math:`m` depends on a kernel used for density estimation. The generic formula is: .. math:: - m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x + m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x In our implementation, :math:`K(x)` is equal to 1 if :math:`x` is small enough and is equal to 0 otherwise. Effectively :math:`K(y - x)` indicates whether :math:`y` is in