-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
auto clusters selection of n_clusters with elbow method #30972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @dineshreddypaidi , I’ve worked on a potential solution for issue #30972 to add an Solution Description
import numpy as np
from sklearn.cluster import KMeans as KMeansBase
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
class KMeans(KMeansBase):
def __init__(self, n_clusters=8, *, max_auto_clusters=10, init='k-means++',
n_init=10, max_iter=300, tol=1e-4, verbose=0, random_state=None,
copy_x=True, algorithm='lloyd'):
super().__init__(
n_clusters=n_clusters, init=init, n_init=n_init, max_iter=max_iter,
tol=tol, verbose=verbose, random_state=random_state, copy_x=copy_x,
algorithm=algorithm
)
self.max_auto_clusters = max_auto_clusters
def _auto_cluster_selection(self, X):
inertias = []
cluster_range = range(1, self.max_auto_clusters + 1)
for k in cluster_range:
model = KMeansBase(
n_clusters=k, init=self.init, n_init=self.n_init,
max_iter=self.max_iter, tol=self.tol, verbose=self.verbose,
random_state=self.random_state, copy_x=self.copy_x,
algorithm=self.algorithm
)
model.fit(X)
inertias.append(model.inertia_)
diffs = np.diff(inertias)
diffs2 = np.diff(diffs)
if len(diffs2) == 0:
return 1
optimal_k = cluster_range[np.argmax(diffs2) + 1]
return optimal_k
def fit(self, X, y=None, sample_weight=None):
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
if isinstance(self.n_clusters, str) and self.n_clusters == "auto":
if self.max_auto_clusters < 1:
raise ValueError("max_auto_clusters must be positive")
self.n_clusters = self._auto_cluster_selection(X)
if self.verbose:
print(f"Auto-selected {self.n_clusters} clusters using elbow method")
return super().fit(X, y, sample_weight)
# Example usage:
# from sklearn.datasets import make_blobs
# X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# model = KMeans(n_clusters="auto", max_auto_clusters=10, random_state=42)
# model.fit(X)
# print(f"Optimal clusters: {model.n_clusters}, Inertia: {model.inertia_}") Before submitting , I’d love your feedback on:
Looking forward to your thoughts! I’m happy to refine this based on your input to align with scikit-learn’s standards. Thanks, |
Thanks for the issue! Honestly this is unlikely to be a high priority for us in the foreseeable future, so I am going to close this one. See for example #6948 where this kind of things were discussed. |
Describe the workflow you want to enable
In, sklearn.cluster the KMeans algorithm.
the feature suggestion is to add the elbow method cluster selection
with n_cluster="auto"
calculates the best no of cluster based on mse
add trains the models based on the return output of auto_cluster_selection()
with auto as keyword in KMeans
Describe your proposed solution
to create a private method in the KMeans to calculate the no of best clusters automatically by taking the n_clusters="auto"
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: