Open
Description
Describe the bug
When trying to fit
PCA with a large enough array that fits in memory (approximately 300 000 x 750
) with an aim to make about 3 times reduction, unstable results are obtained: in most cases the fit succeeds, but occasionaly it hangs "forever".
Random seed is fixed, but unfortunalely it does not help in this case.
The svd_solver
option is left at the default. Looks like explicitly specifying svd_solver='arpack'
does not cause this issue.
Steps/Code to Reproduce
import numpy as np
import random
np.random.seed(42)
random.seed(42)
from sklearn.decomposition import PCA
if __name__ == '__main__':
embeddings = np.random.RandomState(42).normal(size=(328039, 768))
embeddings = embeddings.astype(np.float32)
for i in range(20):
print('Iteration', i)
pca = PCA(n_components=260, random_state=42)
pca = pca.fit(embeddings)
Expected Results
Stable results: consistently successful fits over a finite time.
Actual Results
Sometimes the process hangs, in most cases in the first 10 iterations. I haven't found any consistency pattern at what time it fails.
Versions
System:
python: 3.9.7 | packaged by conda-forge
machine: Windows-10-10.0.19043-SP0
RAM: 32Gb
Python dependencies:
pip: 21.3.1
setuptools: 60.8.1
sklearn: 1.0.1
numpy: 1.20.1
scipy: 1.7.1
Cython: None
pandas: None
matplotlib: None
joblib: 1.1.0
threadpoolctl: 3.1.0
Built with OpenMP: True