Description
Describe the bug
Code using sklearn.mixture.gmm with random seed, is not returning the same result when using scikit-learn versions 1.3.2 versus 1.2.1. The reason is that the function gmm.fit() is using, in some cases, the k-means++ algorithm. This algorithm was improved to receive a new sample_weights parameter that allows different weights for samples during clustering. While gmm.fit() is using k-means++ without using this new parameter, a random object inside the function _kmeans_plusplus is randomizing differently, even when initialized with the same seed.
See sklearn/cluster/_kmeans.py line 229 in version 1.3.2:
center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())
versus line 210 in version 1.2.1:
center_id = random_state.randint(n_samples)
Even when initialized with the same seed, the center_id in both cases is not identical. Our experiments show, the moving from the function random_state.randint()
to random_state.choice()
is not the cause of the change. Calling random_state.randint(X)
and random_state.choice(X)
for the same vector X
, will return the same number. It's the addition of the second argument p
that changes the randomization process, even though the provided value is a uniform distribution.
Steps/Code to Reproduce
import sklearn
from numpy import array
from sklearn.mixture import GaussianMixture as gmm
sklearn.show_versions()
n_bins = 75
feature = array([[2.68317954], [0.07873421], [0.54561186], [0.56156012], [0.82741596], [1.34700796], [1.89033108], [0.56811307], [2.0302233], [0.24878048], [0.80742726], [1.6253749], [1.41693293], [1.09662143], [0.9809438], [1.19137182], [0.24412056], [0.12037048], [1.43140126], [1.17059844], [1.03371682], [0.30759353], [0.62804104], [1.20727346], [1.63631177], [0.254643], [0.32066954], [1.85571007], [1.80921926], [2.35790248], [0.06692233], [0.67287309], [1.94742094], [0.77336118], [1.39175475], [0.55658056], [1.30857636], [0.53104737], [1.6431949], [0.84389686], [1.18674111], [0.83311649], [0.70426956], [1.47487037], [1.60499282], [0.8834691], [0.40238353], [1.22823431], [1.44901594], [1.2031659], [1.07525068], [1.14052601], [1.65095783], [1.22532865], [1.45724683], [2.24075406], [1.12868571], [1.89691056], [1.72251306], [0.65992405], [0.55818081], [0.40910605], [0.95057302], [1.69607038], [0.1173572], [1.31089865], [0.96314239], [0.88541844], [0.32929388], [2.18347304], [0.15985512], [1.89792447], [0.37531604], [0.63067073], [0.24038388]])
gmm_obj = gmm(n_components=n_bins, init_params='k-means++', max_iter=150, random_state=42)
gmm_obj.fit(feature)
print(gmm_obj.predict(feature))
Expected Results
This is the output when you use scikit-learn 1.2.1.
I expect the same output numbers to be obtained when using version 1.3.2 as well.
System:
python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct 2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
machine: macOS-13.5.1-arm64-arm-64bit
Python dependencies:
sklearn: 1.2.1
pip: 23.3.1
setuptools: 65.5.1
numpy: 1.26.2
scipy: 1.11.4
Cython: None
pandas: None
matplotlib: None
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: armv8
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.21.dev
threading_layer: pthreads
architecture: armv8
[ 4 56 49 67 44 37 57 9 19 62 38 55 48 28 45 64 12 27 8 47 20 22 17 41
24 54 59 35 21 18 16 51 30 3 34 73 11 36 58 15 25 63 23 31 5 29 61 69
60 66 43 0 50 13 42 7 52 74 39 33 73 32 53 14 68 71 10 72 40 26 6 2
1 70 65]
Process finished with exit code 0
Actual Results
Using 1.3.2 I would expect the same numbers, having used the same seed.
But it's not the case.
System:
python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct 2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
machine: macOS-13.5.1-arm64-arm-64bit
Python dependencies:
sklearn: 1.3.2
pip: 23.3.1
setuptools: 65.5.1
numpy: 1.26.2
scipy: 1.11.4
Cython: None
pandas: None
matplotlib: None
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: armv8
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.21.dev
threading_layer: pthreads
architecture: armv8
[10 53 48 67 62 32 63 55 12 60 34 47 5 42 38 64 3 11 49 46 30 44 21 65
59 54 57 29 0 4 26 28 24 23 43 73 17 39 7 20 36 45 1 25 33 9 18 35
58 8 2 52 56 69 37 27 22 15 19 50 73 61 16 41 68 71 51 72 13 14 31 74
40 70 66]
Process finished with exit code 0
Versions
This issue is concerning the difference between two versions. The result of sklearn.show_versions()
is included above, in the expected and actual outputs.