auto clusters selection of n_clusters with elbow method #30972

dineshreddypaidi · 2025-03-11T06:01:13Z

Describe the workflow you want to enable

In, sklearn.cluster the KMeans algorithm.
the feature suggestion is to add the elbow method cluster selection
with n_cluster="auto"

calculates the best no of cluster based on mse
add trains the models based on the return output of auto_cluster_selection()
with auto as keyword in KMeans

Describe your proposed solution

to create a private method in the KMeans to calculate the no of best clusters automatically by taking the n_clusters="auto"

Describe alternatives you've considered, if relevant

No response

Additional context

No response

CloudPiyush · 2025-03-13T07:05:40Z

Hi @dineshreddypaidi ,

I’ve worked on a potential solution for issue #30972 to add an n_clusters="auto" option to sklearn.cluster.KMeans using the elbow method for automatic cluster selection. This addresses the need to determine the optimal number of clusters based on inertia without manual tuning. Here’s my proposal:

Solution Description

Adds a private method _auto_cluster_selection() to calculate the best n_clusters using the elbow method (based on the second derivative of inertia).
Extends KMeans.fit() to use this method when n_clusters="auto".
Includes a max_auto_clusters parameter (default: 10) to limit the range of clusters tested.
Switches algorithm="lloyd" to address the deprecated algorithm="auto" warning (future-proof for scikit-learn 1.3+).

import numpy as np
from sklearn.cluster import KMeans as KMeansBase
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted

class KMeans(KMeansBase):
    def __init__(self, n_clusters=8, *, max_auto_clusters=10, init='k-means++', 
                 n_init=10, max_iter=300, tol=1e-4, verbose=0, random_state=None, 
                 copy_x=True, algorithm='lloyd'):
        super().__init__(
            n_clusters=n_clusters, init=init, n_init=n_init, max_iter=max_iter, 
            tol=tol, verbose=verbose, random_state=random_state, copy_x=copy_x, 
            algorithm=algorithm
        )
        self.max_auto_clusters = max_auto_clusters

    def _auto_cluster_selection(self, X):
        inertias = []
        cluster_range = range(1, self.max_auto_clusters + 1)
        for k in cluster_range:
            model = KMeansBase(
                n_clusters=k, init=self.init, n_init=self.n_init, 
                max_iter=self.max_iter, tol=self.tol, verbose=self.verbose, 
                random_state=self.random_state, copy_x=self.copy_x, 
                algorithm=self.algorithm
            )
            model.fit(X)
            inertias.append(model.inertia_)
        diffs = np.diff(inertias)
        diffs2 = np.diff(diffs)
        if len(diffs2) == 0:
            return 1
        optimal_k = cluster_range[np.argmax(diffs2) + 1]
        return optimal_k

    def fit(self, X, y=None, sample_weight=None):
        X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
        if isinstance(self.n_clusters, str) and self.n_clusters == "auto":
            if self.max_auto_clusters < 1:
                raise ValueError("max_auto_clusters must be positive")
            self.n_clusters = self._auto_cluster_selection(X)
            if self.verbose:
                print(f"Auto-selected {self.n_clusters} clusters using elbow method")
        return super().fit(X, y, sample_weight)

# Example usage:
# from sklearn.datasets import make_blobs
# X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# model = KMeans(n_clusters="auto", max_auto_clusters=10, random_state=42)
# model.fit(X)
# print(f"Optimal clusters: {model.n_clusters}, Inertia: {model.inertia_}")

Before submitting , I’d love your feedback on:

Is the elbow method the right default approach, or should we consider alternatives like silhouette score?
Should max_auto_clusters (the upper limit for testing) be configurable, and if so, what’s a sensible default?
Any additional edge cases or tests I should account for?

Looking forward to your thoughts! I’m happy to refine this based on your input to align with scikit-learn’s standards.

Thanks,
Piyush Patil

lesteve · 2025-03-14T10:21:58Z

Thanks for the issue! Honestly this is unlikely to be a high priority for us in the foreseeable future, so I am going to close this one.

See for example #6948 where this kind of things were discussed.

dineshreddypaidi added Needs Triage Issue requires triage New Feature labels Mar 11, 2025

lesteve closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

auto clusters selection of n_clusters with elbow method #30972

auto clusters selection of n_clusters with elbow method #30972

Uh oh!

Uh oh!

Uh oh!

auto clusters selection of n_clusters with elbow method #30972

auto clusters selection of n_clusters with elbow method #30972

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Uh oh!

Solution Description

Uh oh!

Uh oh!

Uh oh!