8000 Using `BisectingKMeans` with `init` raises a possibly dubious `ValueError` · Issue #33146 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Using BisectingKMeans with init raises a possibly dubious ValueError #33146

@oryktos

Description

@oryktos

When using BisectingKMeans, n_clusters greater than 2, using init to specify cluster seeds. A ValueError is raised if returning fewer than n_clusters seeds, but the value of n_clusters passed to init is (correctly, for bisecting k-means) always 2. The size of the returned object must match the "outer" n_clusters to avoid a ValueError, and changing the other cluster seeds gives a different clustering result.

In essence: is it intended behaviour for a bisecting method to require more than 2 initial cluster seeds per bisection, and if so, why does it change the result? Why do some seed orderings not change the result, and some do? (My basic testing is not exhaustive.)


In my case, I was running bisecting k-means clustering on image data. Hence, in my examples below I will use Scikit-Image data for similarity.

For example, this fails:

Calling `init` causes a panic
import numpy as np
import skimage as ski
from sklearn.cluster import BisectingKMeans


def bk_init(X, n_clusters, random_state):
    ## note this is always 2:
    print(n_clusters)

    return np.array(
        [[0], [len(X)]],
        dtype=np.uint,
    )


k = 6

img   = ski.data.camera()
img_v = img.ravel()
img_d = img_v.reshape(-1, 1)

bk = BisectingKMeans(n_clusters=k, init=bk_init)
bk.fit(img_d)

with a ValueError:

`init` -> `ValueError` traceback
Traceback (most recent call last):
  File "<python-input-5>", line 2, in <module>
    bk.fit(img_d)
    ~~~~~~^^^^^^^
  File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/base.py", line 1336, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_bisect_k_means.py", line 430, in fit
    self._bisect(X, x_squared_norms, sample_weight, cluster_to_bisect)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_bisect_k_means.py", line 322, in _bisect
    centers_init = self._init_centroids(
        X,
    ...<4 lines>...
        sample_weight=sample_weight,
    )
  File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py", line 1040, in _init_centroids
    self._validate_center_shape(X, centers)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/<masked>/.venv/lib/python3.13/site-packages/sklearn/cluster/_kmeans.py", line 939, in _validate_center_shape
    raise ValueError(
    ...<2 lines>...
    )
ValueError: The shape of the initial centers (2, 1) does not match the number of clusters 6.

A work-around is to encapsulate the "outer" n_clusters in the init function:

Capturing outer `n_clusters`
import numpy as np
import skimage as ski
from sklearn.cluster import BisectingKMeans


k = 6


def bk_init(X, n_clusters, random_state):
    ## this is still 2:
    print(n_clusters)

    i = np.zeros(
        (k, 1),         # NOTE not ``n_clusters``!
        dtype=np.uint,
    )

    i[0], i[1] = (0, len(X))

    return i


img   = ski.data.camera()
img_v = img.ravel()
img_d = img_v.reshape(-1, 1)

bk = BisectingKMeans(n_clusters=k, init=bk_init)
bk.fit(img_d)

Also see _BisectingTree.split, which appears to show getting indices 0 and 1 only (but I'm probably misinterpreting it). But clearly something else is at play, because changing the ordering of the remaining indices changes the result in some, but not all, cases (though this doesn't cover every permutation):

Different results from different seeds
print(bk.cluster_centers_)  # previous result
[[156.        ]
 [159.        ]
 [174.86169245]
 [ 12.        ]
 [ 10.        ]
 [  0.5       ]]
def bk_init(X, n_clusters, random_state):
    return np.array(
        ## different results to the below and to each other:
        # [[0], [64],    [729], [5555],   [1], [len(X)]],
        # [[0],  [1], [len(X)],   [64], [729],   [5555]],
        ## identical results (yes, even the first two):
        # [[0],    [729], [5555],      [1], [len(X)],   [64]],
        # [[0],   [5555],    [1], [len(X)],     [64],  [729]],
        # [[0], [len(X)],   [64],    [729],   [5555],    [1]],
        # [[0], [len(X)],  [729],   [5555],      [1],   [64]],
        # [[0], [len(X)], [5555],      [1],     [64],  [729]],
        # [[0], [len(X)],    [1],     [64],    [729], [5555]],
        [[0], [len(X)], [64], [729], [5555], [1]],
        dtype=np.uint,
    )


bk = BisectingKMeans(n_clusters=k, init=bk_init)
bk.fit(img_d)
print(bk.cluster_centers_)
[[127.49822064]
 [126.        ]
 [116.        ]
 [112.        ]
 [ 11.96384212]
 [  0.5       ]]

(I'd like to run a test using itertools.permutation.)


Python and package versions
  • uv version: uv 0.9.26 (Homebrew 2026-01-15)
  • uv run python --version: Python 3.13.11
  • pyproject.toml, dependencies:
    • matplotlib>=3.10.8
    • multiprocess>=0.70.18
    • numpy>=2.3.5
    • scikit-image>=0.25.2
    • scikit-learn>=1.8.0
    • scipy>=1.16.3
  • uv run python -c 'import sklearn; sklearn.show_versions()':
    System:
        python: 3.13.11 (main, Dec  5 2025, 16:06:33) [Clang 17.0.0 (clang-1700.6.3.2)]
    executable: /<masked>/.venv/bin/python3
       machine: macOS-15.7.3-arm64-arm-64bit-Mach-O
    
    Python dependencies:
          sklearn: 1.8.0
              pip: None
       setuptools: None
            numpy: 2.3.5
            scipy: 1.16.3
           Cython: None
           pandas: None
       matplotlib: 3.10.8
           joblib: 1.5.3
    threadpoolctl: 3.6.0
    
    Built with OpenMP: True
    
    threadpoolctl info:
           user_api: openmp
       internal_api: openmp
        num_threads: 10
             prefix: libomp
           filepath: /<masked>/.venv/lib/python3.13/site-packages/sklearn/.dylibs/libomp.dylib
            version: None
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0