sklearn.mixture.gmm is not reproducible in version 1.3.2 vs 1.2.1

Describe the bug

Code using sklearn.mixture.gmm with random seed, is not returning the same result when using scikit-learn versions 1.3.2 versus 1.2.1. The reason is that the function gmm.fit() is using, in some cases, the k-means++ algorithm. This algorithm was improved to receive a new sample_weights parameter that allows different weights for samples during clustering. While gmm.fit() is using k-means++ without using this new parameter, a random object inside the function _kmeans_plusplus is randomizing differently, even when initialized with the same seed.

See sklearn/cluster/_kmeans.py line 229 in version 1.3.2:

    center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())

versus line 210 in version 1.2.1:

    center_id = random_state.randint(n_samples)

Even when initialized with the same seed, the center_id in both cases is not identical. Our experiments show, the moving from the function random_state.randint() to random_state.choice() is not the cause of the change. Calling random_state.randint(X) and random_state.choice(X) for the same vector X, will return the same number. It's the addition of the second argument p that changes the randomization process, even though the provided value is a uniform distribution.

Steps/Code to Reproduce

import sklearn
from numpy import array
from sklearn.mixture import GaussianMixture as gmm

sklearn.show_versions()
n_bins = 75
feature = array([[2.68317954], [0.07873421], [0.54561186], [0.56156012], [0.82741596], [1.34700796], [1.89033108], [0.56811307], [2.0302233], [0.24878048], [0.80742726], [1.6253749], [1.41693293], [1.09662143], [0.9809438], [1.19137182], [0.24412056], [0.12037048], [1.43140126], [1.17059844], [1.03371682], [0.30759353], [0.62804104], [1.20727346], [1.63631177], [0.254643], [0.32066954], [1.85571007], [1.80921926], [2.35790248], [0.06692233], [0.67287309], [1.94742094], [0.77336118], [1.39175475], [0.55658056], [1.30857636], [0.53104737], [1.6431949], [0.84389686], [1.18674111], [0.83311649], [0.70426956], [1.47487037], [1.60499282], [0.8834691], [0.40238353], [1.22823431], [1.44901594], [1.2031659], [1.07525068], [1.14052601], [1.65095783], [1.22532865], [1.45724683], [2.24075406], [1.12868571], [1.89691056], [1.72251306], [0.65992405], [0.55818081], [0.40910605], [0.95057302], [1.69607038], [0.1173572], [1.31089865], [0.96314239], [0.88541844], [0.32929388], [2.18347304], [0.15985512], [1.89792447], [0.37531604], [0.63067073], [0.24038388]])
gmm_obj = gmm(n_components=n_bins, init_params='k-means++', max_iter=150, random_state=42)
gmm_obj.fit(feature)
print(gmm_obj.predict(feature))

Expected Results

This is the output when you use scikit-learn 1.2.1.
I expect the same output numbers to be obtained when using version 1.3.2 as well.

System:
    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
   machine: macOS-13.5.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.2.1
          pip: 23.3.1
   setuptools: 65.5.1
        numpy: 1.26.2
        scipy: 1.11.4
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8

[ 4 56 49 67 44 37 57  9 19 62 38 55 48 28 45 64 12 27  8 47 20 22 17 41
 24 54 59 35 21 18 16 51 30  3 34 73 11 36 58 15 25 63 23 31  5 29 61 69
 60 66 43  0 50 13 42  7 52 74 39 33 73 32 53 14 68 71 10 72 40 26  6  2
  1 70 65]

Process finished with exit code 0

Actual Results

Using 1.3.2 I would expect the same numbers, having used the same seed.
But it's not the case.

System:
    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
   machine: macOS-13.5.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.2
          pip: 23.3.1
   setuptools: 65.5.1
        numpy: 1.26.2
        scipy: 1.11.4
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
[10 53 48 67 62 32 63 55 12 60 34 47  5 42 38 64  3 11 49 46 30 44 21 65
 59 54 57 29  0  4 26 28 24 23 43 73 17 39  7 20 36 45  1 25 33  9 18 35
 58  8  2 52 56 69 37 27 22 15 19 50 73 61 16 41 68 71 51 72 13 14 31 74
 40 70 66]

Process finished with exit code 0

Versions

This issue is concerning the difference between two versions. The result of sklearn.show_versions() is included above, in the expected and actual outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions