8000 sklearn.mixture.gmm is not reproducible in version 1.3.2 vs 1.2.1 · Issue #27991 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

sklearn.mixture.gmm is not reproducible in version 1.3.2 vs 1.2.1 #27991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yotama2023 opened this issue Dec 20, 2023 · 2 comments
Closed

sklearn.mixture.gmm is not reproducible in version 1.3.2 vs 1.2.1 #27991

yotama2023 opened this issue Dec 20, 2023 · 2 comments
Labels
Bug Needs Triage Issue requires triage

Comments

@yotama2023
Copy link
yotama2023 commented Dec 20, 2023

Describe the bug

Code using sklearn.mixture.gmm with random seed, is not returning the same result when using scikit-learn versions 1.3.2 versus 1.2.1. The reason is that the function gmm.fit() is using, in some cases, the k-means++ algorithm. This algorithm was improved to receive a new sample_weights parameter that allows different weights for samples during clustering. While gmm.fit() is using k-means++ without using this new parameter, a random object inside the function _kmeans_plusplus is randomizing differently, even when initialized with the same seed.

See sklearn/cluster/_kmeans.py line 229 in version 1.3.2:

    center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())

versus line 210 in version 1.2.1:

    center_id = random_state.randint(n_samples)

Even when initialized with the same seed, the center_id in both cases is not identical. Our experiments show, the moving from the function random_state.randint() to random_state.choice() is not the cause of the change. Calling random_state.randint(X) and random_state.choice(X) for the same vector X, will return the same number. It's the addition of the second argument p that changes the randomization process, even though the provided value is a uniform distribution.

Steps/Code to Reproduce

import sklearn
from numpy import array
from sklearn.mixture import GaussianMixture as gmm

sklearn.show_versions()
n_bins = 75
feature = array([[2.68317954], [0.07873421], [0.54561186], [0.56156012], [0.82741596], [1.34700796], [1.89033108], [0.56811307], [2.0302233], [0.24878048], [0.80742726], [1.6253749], [1.41693293], [1.09662143], [0.9809438], [1.19137182], [0.24412056], [0.12037048], [1.43140126], [1.17059844], [1.03371682], [0.30759353], [0.62804104], [1.20727346], [1.63631177], [0.254643], [0.32066954], [1.85571007], [1.80921926], [2.35790248], [0.06692233], [0.67287309], [1.94742094], [0.77336118], [1.39175475], [0.55658056], [1.30857636], [0.53104737], [1.6431949], [0.84389686], [1.18674111], [0.83311649], [0.70426956], [1.47487037], [1.60499282], [0.8834691], [0.40238353], [1.22823431], [1.44901594], [1.2031659], [1.07525068], [1.14052601], [1.65095783], [1.22532865], [1.45724683], [2.24075406], [1.12868571], [1.89691056], [1.72251306], [0.65992405], [0.55818081], [0.40910605], [0.95057302], [1.69607038], [0.1173572], [1.31089865], [0.96314239], [0.88541844], [0.32929388], [2.18347304], [0.15985512], [1.89792447], [0.37531604], [0.63067073], [0.24038388]])
gmm_obj = gmm(n_components=n_bins, init_params='k-means++', max_iter=150, random_state=42)
gmm_obj.fit(feature)
print(gmm_obj.predict(feature))

Expected Results

This is the output when you use scikit-learn 1.2.1.
I expect the same output numbers to be obtained when using version 1.3.2 as well.

System:
    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
   machine: macOS-13.5.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.2.1
          pip: 23.3.1
   setuptools: 65.5.1
        numpy: 1.26.2
        scipy: 1.11.4
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
[ 4 56 49 67 44 37 57  9 19 62 38 55 48 28 45 64 12 27  8 47 20 22 17 41
 24 54 59 35 21 18 16 51 30  3 34 73 11 36 58 15 25 63 23 31  5 29 61 69
 60 66 43  0 50 13 42  7 52 74 39 33 73 32 53 14 68 71 10 72 40 26  6  2
  1 70 65]

Process finished with exit code 0

Actual Results

Using 1.3.2 I would expect the same numbers, having used the same seed.
But it's not the case.

System:
    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
   machine: macOS-13.5.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.2
          pip: 23.3.1
   setuptools: 65.5.1
        numpy: 1.26.2
        scipy: 1.11.4
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
[10 53 48 67 62 32 63 55 12 60 34 47  5 42 38 64  3 11 49 46 30 44 21 65
 59 54 57 29  0  4 26 28 24 23 43 73 17 39  7 20 36 45  1 25 33  9 18 35
 58  8  2 52 56 69 37 27 22 15 19 50 73 61 16 41 68 71 51 72 13 14 31 74
 40 70 66]

Process finished with exit code 0

Versions

This issue is concerning the difference between two versions. The result of sklearn.show_versions() is included above, in the expected and actual outputs.

@yotama2023 yotama2023 added Bug Needs Triage Issue requires triage labels Dec 20, 2023
@glemaitre
Copy link
Member

We acknowledge the change in the changelog: https://scikit-learn.org/dev/whats_new/v1.3.html#id4

The sample_weight parameter now will be used in centroids initialization for cluster.KMeans, cluster.BisectingKMeans and cluster.MiniBatchKMeans. This change will break backward compatibility, since numbers generated from same random seeds will be different. #25752 by Gleb Levitski, Jérémie du Boisberranger, Guillaume Lemaitre.

This is thus an expected behaviour that should not have an impact statistically speaking.

@yotama2023
Copy link
Author
yotama2023 commented Jan 17, 2024

@glemaitre during the past month I saw how important it is for our project to have backward compatibility, and it's likely that we're not the only one. It would be good if a respectable project like sklearn would support that. May I suggest the following code change.

Instead of line 229 in sklearn/cluster/_kmeans.py (version 1.3.2):

center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())

One can write the following four lines:

if sample_weight.every( v => v === sample_weight[0] )
    center_id = random_state.choice(n_samples)
else:
    center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())

This will ensure that same result as was in previous versions.
Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue requires triage
Projects
None yet
Development

No branches or pull requests

2 participants
0