Using BisectingKMeans with init raises a possibly dubious ValueError

When using BisectingKMeans, n_clusters greater than 2, using init to specify cluster seeds. A ValueError is raised if returning fewer than n_clusters seeds, but the value of n_clusters passed to init is (correctly, for bisecting k-means) always 2. The size of the returned object must match the "outer" n_clusters to avoid a ValueError, and changing the other cluster seeds gives a different clustering result.

In essence: is it intended behaviour for a bisecting method to require more than 2 initial cluster seeds per bisection, and if so, why does it change the result? Why do some seed orderings not change the result, and some do? (My basic testing is not exhaustive.)

In my case, I was running bisecting k-means clustering on image data. Hence, in my examples below I will use Scikit-Image data for similarity.

For example, this fails:

Calling `init` causes a panic

import numpy as np
import skimage as ski
from sklearn.cluster import BisectingKMeans


def bk_init(X, n_clusters, random_state):
    ## note this is always 2:
    print(n_clusters)

    return np.array(
        [[0], [len(X)]],
        dtype=np.uint,
    )


k = 6

img   = ski.data.camera()
img_v = img.ravel()
img_d = img_v.reshape(-1, 1)

bk = BisectingKMeans(n_clusters=k, init=bk_init)
bk.fit(img_d)

with a ValueError:

`init` -> `ValueError` traceback
Traceback (most recent call last): File "<python-input-5>", line 2, in <module> bk.fit(img_d) ~~~~~~^^^^^^^ File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/base.py", line 1336, in wrapper return fit_method(estimator, *args, **kwargs) File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_bisect_k_means.py", line 430, in fit self._bisect(X, x_squared_norms, sample_weight, cluster_to_bisect) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_bisect_k_means.py", line 322, in _bisect centers_init = self._init_centroids( X, ...<4 lines>... sample_weight=sample_weight, ) File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py", line 1040, in _init_centroids self._validate_center_shape(X, centers) ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py", line 939, in _validate_center_shape raise ValueError( ...<2 lines>... ) ValueError: The shape of the initial centers (2, 1) does not match the number of clusters 6.

A work-around is to encapsulate the "outer" n_clusters in the init function:

Capturing outer `n_clusters`

import numpy as np import skimage as ski from sklearn.cluster import BisectingKMeans k = 6 def bk_init(X, n_clusters, random_state): ## this is still 2: print(n_clusters) i = np.zeros( (k, 1), # NOTE not ``n_clusters``! dtype=np.uint, ) i[0], i[1] = (0, len(X)) return i img = ski.data.camera() img_v = img.ravel() img_d = img_v.reshape(-1, 1) bk = BisectingKMeans(n_clusters=k, init=bk_init) bk.fit(img_d)

Also see _BisectingTree.split, which appears to show getting indices 0 and 1 only (but I'm probably misinterpreting it). But clearly something else is at play, because changing the ordering of the remaining indices changes the result in some, but not all, cases (though this doesn't cover every permutation):

Different results from different seeds

print(bk.cluster_centers_) # previous result

[[156. ] [159. ] [174.86169245] [ 12. ] [ 10. ] [ 0.5 ]]

def bk_init(X, n_clusters, random_state): return np.array( ## different results to the below and to each other: # [[0], [64], [729], [5555], [1], [len(X)]], # [[0], [1], [len(X)], [64], [729], [5555]], ## identical results (yes, even the first two): # [[0], [729], [5555], [1], [len(X)], [64]], # [[0], [5555], [1], [len(X)], [64], [729]], # [[0], [len(X)], [64], [729], [5555], [1]], # [[0], [len(X)], [729], [5555], [1], [64]], # [[0], [len(X)], [5555], [1], [64], [729]], # [[0], [len(X)], [1], [64], [729], [5555]], [[0], [len(X)], [64], [729], [5555], [1]], dtype=np.uint, ) bk = BisectingKMeans(n_clusters=k, init=bk_init) bk.fit(img_d) print(bk.cluster_centers_)

[[127.49822064] [126. ] [116. ] [112. ] [ 11.96384212] [ 0.5 ]]

(I'd like to run a test using itertools.permutation.)

Python and package versions

uv version: uv 0.9.26 (Homebrew 2026-01-15)

uv run python --version: Python 3.13.11

pyproject.toml, dependencies:

matplotlib>=3.10.8

multiprocess>=0.70.18

numpy>=2.3.5

scikit-image>=0.25.2

scikit-learn>=1.8.0

scipy>=1.16.3

uv run python -c 'import sklearn; sklearn.show_versions()':
System: python: 3.13.11 (main, Dec 5 2025, 16:06:33) [Clang 17.0.0 (clang-1700.6.3.2)] executable: /<masked>/.venv/bin/python3 machine: macOS-15.7.3-arm64-arm-64bit-Mach-O Python dependencies: sklearn: 1.8.0 pip: None setuptools: None numpy: 2.3.5 scipy: 1.16.3 Cython: None pandas: None matplotlib: 3.10.8 joblib: 1.5.3 threadpoolctl: 3.6.0 Built with OpenMP: True threadpoolctl info: user_api: openmp internal_api: openmp num_threads: 10 prefix: libomp filepath: /<masked>/.venv/lib/python3.13/site-packages/sklearn/.dylibs/libomp.dylib version: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using `BisectingKMeans` with `init` raises a possibly dubious `ValueError` #33146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Using BisectingKMeans with init raises a possibly dubious ValueError #33146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Using `BisectingKMeans` with `init` raises a possibly dubious `ValueError` #33146