You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code using sklearn.mixture.gmm with random seed, is not returning the same result when using scikit-learn versions 1.3.2 versus 1.2.1. The reason is that the function gmm.fit() is using, in some cases, the k-means++ algorithm. This algorithm was improved to receive a new sample_weights parameter that allows different weights for samples during clustering. While gmm.fit() is using k-means++ without using this new parameter, a random object inside the function _kmeans_plusplus is randomizing differently, even when initialized with the same seed.
See sklearn/cluster/_kmeans.py line 229 in version 1.3.2:
Even when initialized with the same seed, the center_id in both cases is not identical. Our experiments show, the moving from the function random_state.randint() to random_state.choice() is not the cause of the change. Calling random_state.randint(X) and random_state.choice(X) for the same vector X, will return the same number. It's the addition of the second argument p that changes the randomization process, even though the provided value is a uniform distribution.
This issue is concerning the difference between two versions. The result of sklearn.show_versions() is included above, in the expected and actual outputs.
The text was updated successfully, but these errors were encountered:
@glemaitre during the past month I saw how important it is for our project to have backward compatibility, and it's likely that we're not the only one. It would be good if a respectable project like sklearn would support that. May I suggest the following code change.
Instead of line 229 in sklearn/cluster/_kmeans.py (version 1.3.2):
Describe the bug
Code using sklearn.mixture.gmm with random seed, is not returning the same result when using scikit-learn versions 1.3.2 versus 1.2.1. The reason is that the function gmm.fit() is using, in some cases, the k-means++ algorithm. This algorithm was improved to receive a new sample_weights parameter that allows different weights for samples during clustering. While gmm.fit() is using k-means++ without using this new parameter, a random object inside the function _kmeans_plusplus is randomizing differently, even when initialized with the same seed.
See sklearn/cluster/_kmeans.py line 229 in version 1.3.2:
versus line 210 in version 1.2.1:
Even when initialized with the same seed, the center_id in both cases is not identical. Our experiments show, the moving from the function
random_state.randint()
torandom_state.choice()
is not the cause of the change. Calling random_state.randint(X)
andrandom_state.choice(X)
for the same vectorX
, will return the same number. It's the addition of the second argumentp
that changes the randomization process, even though the provided value is a uniform distribution.Steps/Code to Reproduce
Expected Results
This is the output when you use scikit-learn 1.2.1.
I expect the same output numbers to be obtained when using version 1.3.2 as well.
Actual Results
Using 1.3.2 I would expect the same numbers, having used the same seed.
But it's not the case.
Versions
This issue is concerning the difference between two versions. The result of
sklearn.show_versions()
is included above, in the expected and actual outputs.The text was updated successfully, but these errors were encountered: