8000 sklearn.mixture.gmm is not reproducible in version 1.3.2 vs 1.2.1 · Issue #27991 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
sklearn.mixture.gmm is not reproducible in version 1.3.2 vs 1.2.1 #27991
Closed
@yotama2023

Description

@yotama2023

Describe the bug

Code using sklearn.mixture.gmm with random seed, is not returning the same result when using scikit-learn versions 1.3.2 versus 1.2.1. The reason is that the function gmm.fit() is using, in some cases, the k-means++ algorithm. This algorithm was improved to receive a new sample_weights parameter that allows different weights for samples during clustering. While gmm.fit() is using k-means++ without using this new parameter, a random object inside the function _kmeans_plusplus is randomizing differently, even when initialized with the same seed.

See sklearn/cluster/_kmeans.py line 229 in version 1.3.2:

    center_id = random_state.choice(n_samples, p=sample_weight / sample_weight.sum())

versus line 210 in version 1.2.1:

    center_id = random_state.randint(n_samples)

Even when initialized with the same seed, the center_id in both cases is not identical. Our experiments show, the moving from the function random_state.randint() to random_state.choice() is not the cause of the change. Calling random_state.randint(X) and random_state.choice(X) for the same vector X, will return the same number. It's the addition of the second argument p that changes the randomization process, even though the provided value is a uniform distribution.

Steps/Code to Reproduce

import sklearn
from numpy import array
from sklearn.mixture import GaussianMixture as gmm

sklearn.show_versions()
n_bins = 75
feature = array([[2.68317954], [0.07873421], [0.54561186], [0.56156012], [0.82741596], [1.34700796], [1.89033108], [0.56811307], [2.0302233], [0.24878048], [0.80742726], [1.6253749], [1.41693293], [1.09662143], [0.9809438], [1.19137182], [0.24412056], [0.12037048], [1.43140126], [1.17059844], [1.03371682], [0.30759353], [0.62804104], [1.20727346], [1.63631177], [0.254643], [0.32066954], [1.85571007], [1.80921926], [2.35790248], [0.06692233], [0.67287309], [1.94742094], [0.77336118], [1.39175475], [0.55658056], [1.30857636], [0.53104737], [1.6431949], [0.84389686], [1.18674111], [0.83311649], [0.70426956], [1.47487037], [1.60499282], [0.8834691], [0.40238353], [1.22823431], [1.44901594], [1.2031659], [1.07525068], [1.14052601], [1.65095783], [1.22532865], [1.45724683], [2.24075406], [1.12868571], [1.89691056], [1.72251306], [0.65992405], [0.55818081], [0.40910605], [0.95057302], [1.69607038], [0.1173572], [1.31089865], [0.96314239], [0.88541844], [0.32929388], [2.18347304], [0.15985512], [1.89792447], [0.37531604], [0.63067073], [0.24038388]])
gmm_obj = gmm(n_components=n_bins, init_params='k-means++', max_iter=150, random_state=42)
gmm_obj.fit(feature)
print(gmm_obj.predict(feature))

Expected Results

This is the output when you use scikit-learn 1.2.1.
I expect the same output numbers to be obtained when using version 1.3.2 as well.

System:
    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
   machine: macOS-13.5.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.2.1
          pip: 23.3.1
   setuptools: 65.5.1
        numpy: 1.26.2
        scipy: 1.11.4
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
[ 4 56 49 67 44 37 57  9 19 62 38 55 48 28 45 64 12 27  8 47 20 22 17 41
 24 54 59 35 21 18 16 51 30  3 34 73 11 36 58 15 25 63 23 31  5 29 61 69
 60 66 43  0 50 13 42  7 52 74 39 33 73 32 53 14 68 71 10 72 40 26  6  2
  1 70 65]

Process finished with exit code 0

Actual Results

Using 1.3.2 I would expect the same numbers, having used the same seed.
But it's not the case.

System:
    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/yotamabramson/PycharmProjects/testProj/venv/bin/python
   machine: macOS-13.5.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.2
          pip: 23.3.1
   setuptools: 65.5.1
        numpy: 1.26.2
        scipy: 1.11.4
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/yotamabramson/PycharmProjects/testProj/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
[10 53 48 67 62 32 63 55 12 60 34 47  5 42 38 64  3 11 49 46 30 44 21 65
 59 54 57 29  0  4 26 28 24 23 43 73 17 39  7 20 36 45  1 25 33  9 18 35
 58  8  2 52 56 69 37 27 22 15 19 50 73 61 16 41 68 71 51 72 13 14 31 74
 40 70 66]

Process finished with exit code 0

Versions

This issue is concerning the difference between two versions. The result of sklearn.show_versions() is included above, in the expected and actual outputs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0