8000 PCA hangs occasionally when applied to data that fits in memory · Issue #22434 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
PCA hangs occasionally when applied to data that fits in memory #22434
Open
@IlyaKachan

Description

@IlyaKachan

Describe the bug

When trying to fit PCA with a large enough array that fits in memory (approximately 300 000 x 750) with an aim to make about 3 times reduction, unstable results are obtained: in most cases the fit succeeds, but occasionaly it hangs "forever".

Random seed is fixed, but unfortunalely it does not help in this case.
The svd_solver option is left at the default. Looks like explicitly specifying svd_solver='arpack' does not cause this issue.

Steps/Code to Reproduce

import numpy as np
import random

np.random.seed(42)
random.seed(42)

from sklearn.decomposition import PCA


if __name__ == '__main__':
    embeddings = np.random.RandomState(42).normal(size=(328039, 768))
    embeddings = embeddings.astype(np.float32)

    for i in range(20):
        print('Iteration', i)
        pca = PCA(n_components=260, random_state=42)
        pca = pca.fit(embeddings)

Expected Results

Stable results: consistently successful fits over a finite time.

Actual Results

Sometimes the process hangs, in most cases in the first 10 iterations. I haven't found any consistency pattern at what time it fails.

Versions

System:
   python: 3.9.7 | packaged by conda-forge 
   machine: Windows-10-10.0.19043-SP0
   RAM: 32Gb

Python dependencies:
          pip: 21.3.1
   setuptools: 60.8.1
      sklearn: 1.0.1
        numpy: 1.20.1
        scipy: 1.7.1
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0