-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Description
When using BisectingKMeans, n_clusters greater than 2, using init to specify cluster seeds. A ValueError is raised if returning fewer than n_clusters seeds, but the value of n_clusters passed to init is (correctly, for bisecting k-means) always 2. The size of the returned object must match the "outer" n_clusters to avoid a ValueError, and changing the other cluster seeds gives a different clustering result.
In essence: is it intended behaviour for a bisecting method to require more than 2 initial cluster seeds per bisection, and if so, why does it change the result? Why do some seed orderings not change the result, and some do? (My basic testing is not exhaustive.)
In my case, I was running bisecting k-means clustering on image data. Hence, in my examples below I will use Scikit-Image data for similarity.
For example, this fails:
Calling `init` causes a panic
import numpy as np
import skimage as ski
from sklearn.cluster import BisectingKMeans
def bk_init(X, n_clusters, random_state):
## note this is always 2:
print(n_clusters)
return np.array(
[[0], [len(X)]],
dtype=np.uint,
)
k = 6
img = ski.data.camera()
img_v = img.ravel()
img_d = img_v.reshape(-1, 1)
bk = BisectingKMeans(n_clusters=k, init=bk_init)
bk.fit(img_d)with a ValueError:
Traceback (most recent call last):
File "<python-input-5>", line 2, in <module>
bk.fit(img_d)
~~~~~~^^^^^^^
File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/base.py", line 1336, in wrapper
return fit_method(estimator, *args, **kwargs)
File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_bisect_k_means.py", line 430, in fit
self._bisect(X, x_squared_norms, sample_weight, cluster_to_bisect)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_bisect_k_means.py", line 322, in _bisect
centers_init = self._init_centroids(
X,
...<4 lines>...
sample_weight=sample_weight,
)
File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py", line 1040, in _init_centroids
self._validate_center_shape(X, centers)
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py", line 939, in _validate_center_shape
raise ValueError(
...<2 lines>...
)
ValueError: The shape of the initial centers (2, 1) does not match the number of clusters 6.
A work-around is to encapsulate the "outer" n_clusters in the init function:
Capturing outer `n_clusters`
import numpy as np
import skimage as ski
from sklearn.cluster import BisectingKMeans
k = 6
def bk_init(X, n_clusters, random_state):
## this is still 2:
print(n_clusters)
i = np.zeros(
(k, 1), # NOTE not ``n_clusters``!
dtype=np.uint,
)
i[0], i[1] = (0, len(X))
return i
img = ski.data.camera()
img_v = img.ravel()
img_d = img_v.reshape(-1, 1)
bk = BisectingKMeans(n_clusters=k, init=bk_init)
bk.fit(img_d)Also see _BisectingTree.split, which appears to show getting indices 0 and 1 only (but I'm probably misinterpreting it). But clearly something else is at play, because changing the ordering of the remaining indices changes the result in some, but not all, cases (though this doesn't cover every permutation):
Different results from different seeds
print(bk.cluster_centers_) # previous result[[156. ]
[159. ]
[174.86169245]
[ 12. ]
[ 10. ]
[ 0.5 ]]
def bk_init(X, n_clusters, random_state):
return np.array(
## different results to the below and to each other:
# [[0], [64], [729], [5555], [1], [len(X)]],
# [[0], [1], [len(X)], [64], [729], [5555]],
## identical results (yes, even the first two):
# [[0], [729], [5555], [1], [len(X)], [64]],
# [[0], [5555], [1], [len(X)], [64], [729]],
# [[0], [len(X)], [64], [729], [5555], [1]],
# [[0], [len(X)], [729], [5555], [1], [64]],
# [[0], [len(X)], [5555], [1], [64], [729]],
# [[0], [len(X)], [1], [64], [729], [5555]],
[[0], [len(X)], [64], [729], [5555], [1]],
dtype=np.uint,
)
bk = BisectingKMeans(n_clusters=k, init=bk_init)
bk.fit(img_d)
print(bk.cluster_centers_)[[127.49822064]
[126. ]
[116. ]
[112. ]
[ 11.96384212]
[ 0.5 ]]
(I'd like to run a test using itertools.permutation.)
Python and package versions
uv version:uv 0.9.26 (Homebrew 2026-01-15)uv run python --version:Python 3.13.11pyproject.toml, dependencies:matplotlib>=3.10.8multiprocess>=0.70.18numpy>=2.3.5scikit-image>=0.25.2scikit-learn>=1.8.0scipy>=1.16.3
uv run python -c 'import sklearn; sklearn.show_versions()':System: python: 3.13.11 (main, Dec 5 2025, 16:06:33) [Clang 17.0.0 (clang-1700.6.3.2)] executable: /<masked>/.venv/bin/python3 machine: macOS-15.7.3-arm64-arm-64bit-Mach-O Python dependencies: sklearn: 1.8.0 pip: None setuptools: None numpy: 2.3.5 scipy: 1.16.3 Cython: None pandas: None matplotlib: 3.10.8 joblib: 1.5.3 threadpoolctl: 3.6.0 Built with OpenMP: True threadpoolctl info: user_api: openmp internal_api: openmp num_threads: 10 prefix: libomp filepath: /<masked>/.venv/lib/python3.13/site-packages/sklearn/.dylibs/libomp.dylib version: None