8000 auto clusters selection of n_clusters with elbow method · Issue #30972 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

auto clusters selection of n_clusters with elbow method #30972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dineshreddypaidi opened this issue Mar 11, 2025 · 2 comments
Closed

auto clusters selection of n_clusters with elbow method #30972

dineshreddypaidi opened this issue Mar 11, 2025 · 2 comments
Labels
Needs Triage Issue requires triage New Feature

Comments

@dineshreddypaidi
Copy link

Describe the workflow you want to enable

In, sklearn.cluster the KMeans algorithm.
the feature suggestion is to add the elbow method cluster selection
with n_cluster="auto"

calculates the best no of cluster based on mse
add trains the models based on the return output of auto_cluster_selection()
with auto as keyword in KMeans

Describe your proposed solution

to create a private method in the KMeans to calculate the no of best clusters automatically by taking the n_clusters="auto"

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@CloudPiyush
Copy link
CloudPiyush commented Mar 13, 2025

Hi @dineshreddypaidi ,

I’ve worked on a potential solution for issue #30972 to add an n_clusters="auto" option to sklearn.cluster.KMeans using the elbow method for automatic cluster selection. This addresses the need to determine the optimal number of clusters based on inertia without manual tuning. Here’s my proposal:

Solution Description

  • Adds a private method _auto_cluster_selection() to calculate the best n_clusters using the elbow method (based on the second derivative of inertia).
  • Extends KMeans.fit() to use this method when n_clusters="auto".
  • Includes a max_auto_clusters parameter (default: 10) to limit the range of clusters tested.
  • Switches algorithm="lloyd" to address the deprecated algorithm="auto" warning (future-proof for scikit-learn 1.3+).
import numpy as np
from sklearn.cluster import KMeans as KMeansBase
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted

class KMeans(KMeansBase):
    def __init__(self, n_clusters=8, *, max_auto_clusters=10, init='k-means++', 
                 n_init=10, max_iter=300, tol=1e-4, verbose=0, random_state=None, 
                 copy_x=True, algorithm='lloyd'):
        super().__init__(
            n_clusters=n_clusters, init=init, n_init=n_init, max_iter=max_iter, 
            tol=tol, verbose=verbose, random_state=random_state, copy_x=copy_x, 
            algorithm=algorithm
        )
        self.max_auto_clusters = max_auto_clusters

    def _auto_cluster_selection(self, X):
        inertias = []
        cluster_range = range(1, self.max_auto_clusters + 1)
        for k in cluster_range:
            model = KMeansBase(
                n_clusters=k, init=self.init, n_init=self.n_init, 
                max_iter=self.max_iter, tol=self.tol, verbose=self.verbose, 
                random_state=self.random_state, copy_x=self.copy_x, 
                algorithm=self.algorithm
            )
            model.fit(X)
            inertias.append(model.inertia_)
        diffs = np.diff(inertias)
        diffs2 = np.diff(diffs)
        if len(diffs2) == 0:
            return 1
        optimal_k = cluster_range[np.argmax(diffs2) + 1]
        return optimal_k

    def fit(self, X, y=None, sample_weight=None):
        X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
        if isinstance(self.n_clusters, str) and self.n_clusters == "auto":
            if self.max_auto_clusters < 1:
                raise ValueError("max_auto_clusters must be positive")
            self.n_clusters = self._auto_cluster_selection(X)
            if self.verbose:
                print(f"Auto-selected {self.n_clusters} clusters using elbow method")
        return super().fit(X, y, sample_weight)

# Example usage:
# from sklearn.datasets import make_blobs
# X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# model = KMeans(n_clusters="auto", max_auto_clusters=10, random_state=42)
# model.fit(X)
# print(f"Optimal clusters: {model.n_clusters}, Inertia: {model.inertia_}")

Before submitting , I’d love your feedback on:

  1. Is the elbow method the right default approach, or should we consider alternatives like silhouette score?
  2. Should max_auto_clusters (the upper limit for testing) be configurable, and if so, what’s a sensible default?
  3. Any additional edge cases or tests I should account for?

Looking forward to your thoughts! I’m happy to refine this based on your input to align with scikit-learn’s standards.

Thanks,
Piyush Patil

@lesteve
Copy link
Member
lesteve commented Mar 14, 2025

Thanks for the issue! Honestly this is unlikely to be a high priority for us in the foreseeable future, so I am going to close this one.

See for example #6948 where this kind of things were discussed.

@lesteve lesteve closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage New Feature
Projects
None yet
Development

No branches or pull requests

3 participants
0