8000 RandomForestClassifier parallel issues with CPU usage decreasing over run · Issue #12482 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

RandomForestClassifier parallel issues with CPU usage decreasing over run #12482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hermidalc opened this issue Oct 29, 2018 · 10 comments
Closed

Comments

@hermidalc
Copy link
Contributor
hermidalc commented Oct 29, 2018

Description

Related or identical to issue #6023 but it seems as of 0.19.2 it's not fixed even though that issue is closed. I encountered it not with GridSearchCV but with RFE wrapping RF. I get the exact same strange behavior where parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The fit never finishes as well (or takes way too long if it ever does finish)

Steps/Code to Reproduce

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=3200, n_informative=100, n_redundant=3100, n_classes=2, n_clusters_per_class=30)

pipe = Pipeline([
    ('slr', StandardScaler()),
    ('fs', RFE(RandomForestClassifier(n_estimators=1000, max_features='auto', class_weight='balanced', n_jobs=-1), step=0.01, n_features_to_select=10))
])
pipe.fit(X, y)

Expected Results

Parallel CPU usage to be effectively 100% on number of cores = n_jobs for each iteration of RFE and for the pipeline fit to complete in a normal time.

Actual Results

Parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The pipeline fit never finishes.

Versions

Linux-4.18.16-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
NumPy 1.14.3
SciPy 1.1.0
Scikit-Learn 0.19.2

@amueller
Copy link
Member
amueller commented Oct 29, 2018

As I said, please try with current scikit-learn, i.e. 0.20.0. Also, please provide a self-contained example. Your code is missing the definition of X and y.

@ogrisel
Copy link
Member
ogrisel commented Nov 7, 2018

I tried with the following on today's scikit-learn master and everything looks fine:

>>> import numpy as np
>>> X = np.random.randn(int(1e6), 100)
>>> y = X[:, 0] > 0
>>> from sklearn.ensemble import RandomForestClassifier
... from sklearn.feature_selection import RFE
... from sklearn.pipeline import Pipeline
... from sklearn.preprocessing import StandardScaler
... 
... pipe = Pipeline([
...     ('slr', StandardScaler()),
...     ('fs', RFE(RandomForestClassifier(n_estimators=1000, max_features='auto', class_weight='balanced', n_jobs=-1), n_features_to_select=10))
... ])
... pipe.fit(X, y)
... 

CPU usage is 100% green on 4 CPUs in htop. Memory usage is constant. I ran it only for a couple of minutes before interrupting.

With a smaller dataset, (1e3 samples, 100 features) CPU usage cannot reach 100% because probably because individual trees are too fast to fit and scheduling overhead and the sequential segments of RFE are probably not longer negligible.

@hermidalc
Copy link
Contributor Author
hermidalc commented Nov 10, 2018

@mueller @ogrisel your above test dataset and code also works properly in my environment. I've updated the OP to include a test dataset that results in the bad parallel behavior.

@amueller
Copy link
Member
amueller commented Nov 12, 2018

I get high CPU usage for your example as well. wrt never finish:
I expect each tree takes a short time to build, but with RFE in there, you're building 100 * 1000 trees. So it's not surprising that it takes a while, in particular without specifying max_depth.

@amueller
Copy link
Member

Extrapolating from a using a single core with 10 trees, this should take about 30 minutes to finish.

@amueller
Copy link
Member

Looks like setting n_jobs=-1 is slightly slower here because the trees are relatively small and it would take 55 minutes with n_jobs=-1 instead of 30 minutes with n_jobs=1. I don't see a bug, though.

@hermidalc
Copy link
Contributor Author

I get high CPU usage for your example as well. wrt never finish:
I expect each tree takes a short time to build, but with RFE in there, you're building 100 * 1000 trees. So it's not surprising that it takes a while, in particular without specifying max_depth.

The behavior I'm seeing is exactly what #6023 mentioned with my example. If you look at the CPU usage it start sat 100% for all cores like it should, but then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. All cores should pretty much stay at 100% the entire time, even if it takes longer than with n_jobs=1, no?

@amueller
Copy link
Member

@hermidalc no because the bottleneck is communication between the cores.

@rajeshjnu2006
Copy link
rajeshjnu2006 commented Dec 12, 2019

After suffering from 100% CPU usage for many days, I came to this discussion board. Just changes n_jobs=-1 to n_jobs=4, it worked.

@thomasjpfan
Copy link
Member

I am closing this issue because the answer to the original issue can be found at #12482 (comment) and #12482 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0