Description
Describe the bug
I am running KMeans
with a set random_state
which is generating the same clusters but does not always apply the same label values to the clusters. I realize that the labels have no inherent significance, but I am writing a notebook and this makes it impossible to use the label number to refer to the clusters, so it would be very helpful for this to be consistent as well. I have noticed that setting n_jobs=1
prevents this from occurring, but n_jobs
is deprecated so that is not a long term solution.
Steps/Code to Reproduce
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2)
x = np.random.normal(size=(1800, 2))
x[:700, 0] += 3
x[:700, 1] += 3
x[700:1200, 0] -= 0.5
x[700:1200, 1] -= 0.5
x[1200:, 0] += 3
x[1200:, 1] -= 3
np.random.shuffle(x)
first = None
while True: # it typically only takes a few iterations for a difference to occur
km = KMeans(n_clusters=3, random_state=10)
km.fit(x)
pred = km.predict(x)
if first is None:
first = pred
elif not np.array_equal(first, pred):
print(first)
print(pred)
fig, ax = plt.subplots(1,2)
for label in range(3):
clusters = x[first == label]
cluster = x[pred == label]
ax[0].scatter(clusters[:, 0], clusters[:, 1], label=label)
ax[1].scatter(cluster[:, 0], cluster[:, 1], label=label)
break
ax[0].legend()
ax[1].legend()
plt.show()
Expected Results
Labels are the same each time
Actual Results
[0 1 1 ... 2 0 0]
[0 2 2 ... 1 0 0]
Versions
System:
python: 3.8.3 (default, May 19 2020, 13:54:14) [Clang 10.0.0 ]
executable: /Users/devin/anaconda3/envs/sw38/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
pip: 20.2.2
setuptools: 52.0.0.post20210125
sklearn: 0.24.2
numpy: 1.18.1
scipy: 1.4.1
Cython: 0.29.21
pandas: 1.0.3
matplotlib: 3.3.4
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True