-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
PCA hangs occasionally when applied to data that fits in memory #22434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I cannot reproduce with the following system and packages version:
|
Huh, I'm actually able to reproduce this error semi-reliably. When it occurs still varies -- in one instance it started hanging on iteration 3 while in another it was on iteration 7 -- but I think there's a genuine issue here. On average one iteration would take ~25sec but after letting the stuck fit proceed for a few minutes I noticed no significant change in my system's resource usages (memory or cpu). I have no clue what could be causing this but I'll try to take a look. Here are my versions:
|
Here is a workaround
Interestingly, I've found that when PCA is fit many time in the same loop with fewer computation and disk I/O operations in between PCA fits, I run into issues but when I do a similar number of PCA fits with many more computations and I/O in between it consistently does not hang. Not sure if this means the issue is in some C/Cython code or what, I haven't looked but thought it was interesting and might help troublehshooting. |
I discovered this issue myself and stumbled upon this issue thread. Seeing mentions of memory usage in this thread, I tried this approach: for i in range(loop):
from sklearn.decomposition import PCA
data = ...
output = PCA(...).fit(data)
del PCA
del data Delete the data and PCA at the end of an iteration, and the code runs normally! Does this mean there's a memory leak? |
I am convinced that this is a threading problem for two reasons. When the deadlock happens, it looks like all cores are being used off and on but they are fluctuating up and down constantly. Also, the problem occurs on the first call to This is hard to diagnose because there is no explicit
|
I can confirm that this is an issue with the threading backend - the sample provided runs normally if the backend number of threads is forced to one using environment variables. The workaround is using threadpoolctl to force single threaded mode: import numpy as np
import random
from threadpoolctl import threadpool_limits
np.random.seed(42)
random.seed(42)
from sklearn.decomposition import PCA
if __name__ == '__main__':
embeddings = np.random.RandomState(42).normal(size=(328039, 768))
embeddings = embeddings.astype(np.float32)
for i in range(20):
print('Iteration', i)
with threadpool_limits(limits=1):
pca = PCA(n_components=260
8000
, random_state=42)
pca = pca.fit(embeddings) but I'm not sure if this solution will work if there is also multithreaded code somewhere else in the script. |
…void sklearn issue #22434 (scikit-learn/scikit-learn#22434), whereupon PCA.fit() hangs due to conflicts with Pytorch multithreaded code.
It sounds like a bug in the multithreading code of OpenBLAS or MKL. The PCA estimator (with its Could some that can reproduce the deadlock use import numpy as np
import random
import faulthandler
np.random.seed(42)
random.seed(42)
from sklearn.decomposition import PCA
if __name__ == '__main__':
embeddings = np.random.RandomState(42).normal(size=(328039, 768))
embeddings = embeddings.astype(np.float32)
for i in range(20):
print('Iteration', i)
faulthandler.dump_traceback_later(10, exit=True)
pca = PCA(svd_solver="randomized", n_components=260, random_state=42)
pca = pca.fit(embeddings)
faulthandler.cancel_dump_traceback_later() |
Note that for this kind of data set sizes, the new solver being developed in #27491 should work much faster and use fewer memory. |
* Added dedicated batch-size for Jacobian computation, to avoid OOM errors. * Added flag controlling the number of training samples used for computing the average Jacobian norm. * Low-memory Jacobian computation code. * Switching to Pytorch to estimate feature covariance eigenvalues, to avoid sklearn issue #22434 (scikit-learn/scikit-learn#22434), whereupon PCA.fit() hangs due to conflicts with Pytorch multithreaded code. * CUDA svd computation does not support Half precision format, so the eigenspectrum of activations is estimated outside of the autocast context. * Load pretrained features and compute Jacobian of the resulting model function, without linear probes. * Added Rankme implementation. * Computing alpha on the training set rather than the test set. * Linear probes trained with lr decay schedule. * ViT backbone, supporting ViT-S/16, as well as custom variants ViT-Small/16 and ViT-Tiny/16.
* Added dedicated batch-size for Jacobian computation, to avoid OOM errors. * Added flag controlling the number of training samples used for computing the average Jacobian norm. * Low-memory Jacobian computation code. * Switching to Pytorch to estimate feature covariance eigenvalues, to avoid sklearn issue #22434 (scikit-learn/scikit-learn#22434), whereupon PCA.fit() hangs due to conflicts with Pytorch multithreaded code. * CUDA svd computation does not support Half precision format, so the eigenspectrum of activations is estimated outside of the autocast context. * Load pretrained features and compute Jacobian of the resulting model function, without linear probes. * Added Rankme implementation. * Computing alpha on the training set rather than the test set. * Linear probes trained with lr decay schedule. * ViT backbone, supporting ViT-S/16, as well as custom variants ViT-Small/16 and ViT-Tiny/16.
The problem is still there if the user code is changed to call It would be interesting to find a reproducer that only involves numpy and scipy and report it upstream. |
Describe the bug
When trying to
fit
PCA with a large enough array that fits in memory (approximately300 000 x 750
) with an aim to make about 3 times reduction, unstable results are obtained: in most cases the fit succeeds, but occasionaly it hangs "forever".Random seed is fixed, but unfortunalely it does not help in this case.
The
svd_solver
option is left at the default. Looks like explicitly specifyingsvd_solver='arpack'
does not cause this issue.Steps/Code to Reproduce
Expected Results
Stable results: consistently successful fits over a finite time.
Actual Results
Sometimes the process hangs, in most cases in the first 10 iterations. I haven't found any consistency pattern at what time it fails.
Versions
System: python: 3.9.7 | packaged by conda-forge machine: Windows-10-10.0.19043-SP0 RAM: 32Gb Python dependencies: pip: 21.3.1 setuptools: 60.8.1 sklearn: 1.0.1 numpy: 1.20.1 scipy: 1.7.1 Cython: None pandas: None matplotlib: None joblib: 1.1.0 threadpoolctl: 3.1.0 Built with OpenMP: True
The text was updated successfully, but these errors were encountered: