-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
RandomForestClassifier parallel issues with CPU usage decreasing over run #12482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As I said, please try with current scikit-learn, i.e. 0.20.0. Also, please provide a self-contained example. Your code is missing the definition of X and y. |
I tried with the following on today's scikit-learn master and everything looks fine: >>> import numpy as np
>>> X = np.random.randn(int(1e6), 100)
>>> y = X[:, 0] > 0
>>> from sklearn.ensemble import RandomForestClassifier
... from sklearn.feature_selection import RFE
... from sklearn.pipeline import Pipeline
... from sklearn.preprocessing import StandardScaler
...
... pipe = Pipeline([
... ('slr', StandardScaler()),
... ('fs', RFE(RandomForestClassifier(n_estimators=1000, max_features='auto', class_weight='balanced', n_jobs=-1), n_features_to_select=10))
... ])
... pipe.fit(X, y)
... CPU usage is 100% green on 4 CPUs in htop. Memory usage is constant. I ran it only for a couple of minutes before interrupting. With a smaller dataset, (1e3 samples, 100 features) CPU usage cannot reach 100% because probably because individual trees are too fast to fit and scheduling overhead and the sequential segments of RFE are probably not longer negligible. |
I get high CPU usage for your example as well. wrt never finish: |
Extrapolating from a using a single core with 10 trees, this should take about 30 minutes to finish. |
Looks like setting n_jobs=-1 is slightly slower here because the trees are relatively small and it would take 55 minutes with n_jobs=-1 instead of 30 minutes with n_jobs=1. I don't see a bug, though. |
The behavior I'm seeing is exactly what #6023 mentioned with my example. If you look at the CPU usage it start sat 100% for all cores like it should, but then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. All cores should pretty much stay at 100% the entire time, even if it takes longer than with n_jobs=1, no? |
@hermidalc no because the bottleneck is communication between the cores. |
After suffering from 100% CPU usage for many days, I came to this discussion board. Just changes n_jobs=-1 to n_jobs=4, it worked. |
I am closing this issue because the answer to the original issue can be found at #12482 (comment) and #12482 (comment) |
Uh oh!
There was an error while loading. Please reload this page.
Description
Related or identical to issue #6023 but it seems as of 0.19.2 it's not fixed even though that issue is closed. I encountered it not with GridSearchCV but with RFE wrapping RF. I get the exact same strange behavior where parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The fit never finishes as well (or takes way too long if it ever does finish)
Steps/Code to Reproduce
Expected Results
Parallel CPU usage to be effectively 100% on number of cores = n_jobs for each iteration of RFE and for the pipeline fit to complete in a normal time.
Actual Results
Parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The pipeline fit never finishes.
Versions
Linux-4.18.16-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
NumPy 1.14.3
SciPy 1.1.0
Scikit-Learn 0.19.2
The text was updated successfully, but these errors were encountered: