8000 PCA hangs occasionally when applied to data that fits in memory · Issue #22434 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

PCA hangs occasionally when applied to data that fits in memory #22434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
IlyaKachan opened this issue Feb 10, 2022 · 10 comments
Open

PCA hangs occasionally when applied to data that fits in memory #22434

IlyaKachan opened this issue Feb 10, 2022 · 10 comments
Labels

Comments

@IlyaKachan
Copy link

Describe the bug

When trying to fit PCA with a large enough array that fits in memory (approximately 300 000 x 750) with an aim to make about 3 times reduction, unstable results are obtained: in most cases the fit succeeds, but occasionaly it hangs "forever".

Random seed is fixed, but unfortunalely it does not help in this case.
The svd_solver option is left at the default. Looks like explicitly specifying svd_solver='arpack' does not cause this issue.

Steps/Code to Reproduce

import numpy as np
import random

np.random.seed(42)
random.seed(42)

from sklearn.decomposition import PCA


if __name__ == '__main__':
    embeddings = np.random.RandomState(42).normal(size=(328039, 768))
    embeddings = embeddings.astype(np.float32)

    for i in range(20):
        print('Iteration', i)
        pca = PCA(n_components=260, random_state=42)
        pca = pca.fit(embeddings)

Expected Results

Stable results: consistently successful fits over a finite time.

Actual Results

Sometimes the process hangs, in most cases in the first 10 iterations. I haven't found any consistency pattern at what time it fails.

Versions

System:
   python: 3.9.7 | packaged by conda-forge 
   machine: Windows-10-10.0.19043-SP0
   RAM: 32Gb

Python dependencies:
          pip: 21.3.1
   setuptools: 60.8.1
      sklearn: 1.0.1
        numpy: 1.20.1
        scipy: 1.7.1
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True
@IlyaKachan IlyaKachan added Bug Needs Triage Issue requires triage labels Feb 10, 2022
@glemaitre
Copy link
Member

I cannot reproduce with the following system and packages version:

System:
    python: 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:38:21)  [Clang 11.1.0 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python
   machine: macOS-12.1-arm64-arm-64bit

Python dependencies:
          pip: 21.3
   setuptools: 58.2.0
      sklearn: 1.1.dev0
        numpy: 1.21.2
        scipy: 1.8.0.dev0+1902.b795164
       Cython: 0.29.24
       pandas: 1.3.3
   matplotlib: 3.4.3
       joblib: 1.2.0.dev0
threadpoolctl: 3.0.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libopenblas_vortexp-r0.3.18.dylib
        version: 0.3.18
threading_layer: openmp
   architecture: VORTEX
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libomp.dylib
        version: None
    num_threads: 8

@thomasjpfan thomasjpfan added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Mar 25, 2022
@Micky774
Copy link
Contributor

Huh, I'm actually able to reproduce this error semi-reliably. When it occurs still varies -- in one instance it started hanging on iteration 3 while in another it was on iteration 7 -- but I think there's a genuine issue here. On average one iteration would take ~25sec but after letting the stuck fit proceed for a few minutes I noticed no significant change in my system's resource usages (memory or cpu).

I have no clue what could be causing this but I'll try to take a look.

Here are my versions:

System:
    python: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
executable: R:\ProgramFiles\anaconda3\envs\scikit-dev\python.exe
   machine: Windows-10-10.0.19043-SP0

Python dependencies:
      sklearn: 1.1.dev0
          pip: 21.2.4
   setuptools: 58.0.4
        numpy: 1.21.5
        scipy: 1.7.3
       Cython: 0.29.26
       pandas: 1.3.5
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: R:\ProgramFiles\anaconda3\envs\scikit-dev\Lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
        version: 0.3.17
threading_layer: pthreads
   architecture: Zen
    num_threads: 12

       user_api: openmp
   internal_api: openmp
         prefix: vcomp
       filepath: C:\Windows\System32\vcomp140.dll
        version: None
    num_threads: 12

@alexrockhill
Copy link
alexrockhill commented Jun 6, 2022

Here is a workaround

import numpy as np
from multiprocessing import Pool, TimeoutError
from sklearn.decomposition import PCA

rng = np.random.default_rng(42)
X = rng.normal(size=(328039, 768)).astype(np.float32)

def fit_pca(X):
    return PCA(n_components=260, random_state=42).fit(X)

for i in range(20):
    print(f'Iteration {i}')
    with Pool(processes=1) as pool:
        res = pool.apply_async(fit_pca, (X,))
        try:
            pca = res.get(timeout=10)
        except TimeoutError:
            print('Timed out')
            pca = None
    print(pca)

Interestingly, I've found that when PCA is fit many time in the same loop with fewer computation and disk I/O operations in between PCA fits, I run into issues but when I do a similar number of PCA fits with many more computations and I/O in between it consistently does not hang. Not sure if this means the issue is in some C/Cython code or what, I haven't looked but thought it was interesting and might help troublehshooting.

@bfbarry
Copy link
bfbarry commented Feb 28, 2023

I discovered this issue myself and stumbled upon this issue thread. Seeing mentions of memory usage in this thread, I tried this approach:

for i in range(loop):
    from sklearn.decomposition import PCA
    data = ...
    output = PCA(...).fit(data)
    del PCA
    del data

Delete the data and PCA at the end of an iteration, and the code runs normally!

Does this mean there's a memory leak?

@jubbens
Copy link
jubbens commented Jun 6, 2023

I am convinced that this is a threading problem for two reasons. When the deadlock happens, it looks like all cores are being used off and on but they are fluctuating up and down constantly. Also, the problem occurs on the first call to .fit() for me if there is other multi-threaded code before the fit.

This is hard to diagnose because there is no explicit n_jobs for the PCA module and so the parallelization must be happening somewhere out of view.

System:
    python: 3.9.16 (main, Jan 11 2023, 16:05:54)  [GCC 11.2.0]
executable: /home/jordan/miniconda3/envs/test/bin/python
   machine: Linux-5.19.0-43-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.2.1
          pip: 23.0.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.10.0
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /home/jordan/miniconda3/envs/test/lib/libmkl_rt.so.1
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 10
threading_layer: intel

       filepath: /home/jordan/miniconda3/envs/test/lib/libiomp5.so
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 20

       filepath: /home/jordan/miniconda3/envs/test/lib/libgomp.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 20

@jubbens
Copy link
jubbens commented Jun 6, 2023

I can confirm that this is an issue with the threading backend - the sample provided runs normally if the backend number of threads is forced to one using environment variables.

The workaround is using threadpoolctl to force single threaded mode:

import numpy as np
import random
from threadpoolctl import threadpool_limits

np.random.seed(42)
random.seed(42)

from sklearn.decomposition import PCA


if __name__ == '__main__':
    embeddings = np.random.RandomState(42).normal(size=(328039, 768))
    embeddings = embeddings.astype(np.float32)

    for i in range(20):
        print('Iteration', i)

        with threadpool_limits(limits=1):
            pca = PCA(n_components=260
8000
, random_state=42)
            pca = pca.fit(embeddings)

but I'm not sure if this solution will work if there is also multithreaded code somewhere else in the script.

magamba pushed a commit to magamba/fastssl that referenced this issue Sep 28, 2023
…void sklearn issue #22434 (scikit-learn/scikit-learn#22434), whereupon PCA.fit() hangs due to conflicts with Pytorch multithreaded code.
@ogrisel
Copy link
Member
ogrisel commented Oct 6, 2023

It sounds like a bug in the multithreading code of OpenBLAS or MKL. The PCA estimator (with its svd_solver="randomized" setting) itself does not use multithreading itself directly. It only does so via OpenBLAS or MKL.

Could some that can reproduce the deadlock use faulthandler.dump_traceback_later to dump the state of the threads when the deadlock happens?

import numpy as np
import random
import faulthandler

np.random.seed(42)
random.seed(42)

from sklearn.decomposition import PCA


if __name__ == '__main__':
    embeddings = np.random.RandomState(42).normal(size=(328039, 768))
    embeddings = embeddings.astype(np.float32)

    for i in range(20):
        print('Iteration', i)
        faulthandler.dump_traceback_later(10, exit=True)
        pca = PCA(svd_solver="randomized", n_components=260, random_state=42)
        pca = pca.fit(embeddings)
        faulthandler.cancel_dump_traceback_later()

@ogrisel
Copy link
Member
ogrisel commented Oct 6, 2023

Note that for this kind of data set sizes, the new solver being developed in #27491 should work much faster and use fewer memory.

magamba pushed a commit to magamba/fastssl that referenced this issue Mar 25, 2024
* Added dedicated batch-size for Jacobian computation, to avoid OOM errors.
* Added flag controlling the number of training samples used for computing the average Jacobian norm.
* Low-memory Jacobian computation code.
* Switching to Pytorch to estimate feature covariance eigenvalues, to avoid sklearn issue #22434 (scikit-learn/scikit-learn#22434), whereupon PCA.fit() hangs due to conflicts with Pytorch multithreaded code.
* CUDA svd computation does not support Half precision format, so the eigenspectrum of activations is estimated outside of the autocast context.
* Load pretrained features and compute Jacobian of the resulting model function, without linear probes.
* Added Rankme implementation.
* Computing alpha on the training set rather than the test set.
* Linear probes trained with lr decay schedule.
* ViT backbone, supporting ViT-S/16, as well as custom variants ViT-Small/16 and ViT-Tiny/16.
magamba pushed a commit to magamba/fastssl that referenced this issue Mar 26, 2024
* Added dedicated batch-size for Jacobian computation, to avoid OOM errors.
* Added flag controlling the number of training samples used for computing the average Jacobian norm.
* Low-memory Jacobian computation code.
* Switching to Pytorch to estimate feature covariance eigenvalues, to avoid sklearn issue #22434 (scikit-learn/scikit-learn#22434), whereupon PCA.fit() hangs due to conflicts with Pytorch multithreaded code.
* CUDA svd computation does not support Half precision format, so the eigenspectrum of activations is estimated outside of the autocast context.
* Load pretrained features and compute Jacobian of the resulting model function, without linear probes.
* Added Rankme implementation.
* Computing alpha on the training set rather than the test set.
* Linear probes trained with lr decay schedule.
* ViT backbone, supporting ViT-S/16, as well as custom variants ViT-Small/16 and ViT-Tiny/16.
@bfbarry
Copy link
bfbarry commented Sep 9, 2024

@ogrisel Is this issue resolved given the advent of #27491?

@ogrisel
Copy link
Member
ogrisel commented Sep 10, 2024

The problem is still there if the user code is changed to call PCA with svd_solver="randomized" explicitly.

It would be interesting to find a reproducer that only involves numpy and scipy and report it upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants
0