RandomForestClassifier parallel issues with CPU usage decreasing over run #12482

hermidalc · 2018-10-29T15:49:59Z

Description

Related or identical to issue #6023 but it seems as of 0.19.2 it's not fixed even though that issue is closed. I encountered it not with GridSearchCV but with RFE wrapping RF. I get the exact same strange behavior where parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The fit never finishes as well (or takes way too long if it ever does finish)

Steps/Code to Reproduce

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=3200, n_informative=100, n_redundant=3100, n_classes=2, n_clusters_per_class=30)

pipe = Pipeline([
    ('slr', StandardScaler()),
    ('fs', RFE(RandomForestClassifier(n_estimators=1000, max_features='auto', class_weight='balanced', n_jobs=-1), step=0.01, n_features_to_select=10))
])
pipe.fit(X, y)

Expected Results

Parallel CPU usage to be effectively 100% on number of cores = n_jobs for each iteration of RFE and for the pipeline fit to complete in a normal time.

Actual Results

Parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. The pipeline fit never finishes.

Versions

Linux-4.18.16-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
NumPy 1.14.3
SciPy 1.1.0
Scikit-Learn 0.19.2

amueller · 2018-10-29T16:46:29Z

As I said, please try with current scikit-learn, i.e. 0.20.0. Also, please provide a self-contained example. Your code is missing the definition of X and y.

ogrisel · 2018-11-07T10:53:15Z

I tried with the following on today's scikit-learn master and everything looks fine:

>>> import numpy as np
>>> X = np.random.randn(int(1e6), 100)
>>> y = X[:, 0] > 0
>>> from sklearn.ensemble import RandomForestClassifier
... from sklearn.feature_selection import RFE
... from sklearn.pipeline import Pipeline
... from sklearn.preprocessing import StandardScaler
... 
... pipe = Pipeline([
...     ('slr', StandardScaler()),
...     ('fs', RFE(RandomForestClassifier(n_estimators=1000, max_features='auto', class_weight='balanced', n_jobs=-1), n_features_to_select=10))
... ])
... pipe.fit(X, y)
...

CPU usage is 100% green on 4 CPUs in htop. Memory usage is constant. I ran it only for a couple of minutes before interrupting.

With a smaller dataset, (1e3 samples, 100 features) CPU usage cannot reach 100% because probably because individual trees are too fast to fit and scheduling overhead and the sequential segments of RFE are probably not longer negligible.

hermidalc · 2018-11-10T20:11:21Z

@mueller @ogrisel your above test dataset and code also works properly in my environment. I've updated the OP to include a test dataset that results in the bad parallel behavior.

amueller · 2018-11-12T20:17:29Z

I get high CPU usage for your example as well. wrt never finish:
I expect each tree takes a short time to build, but with RFE in there, you're building 100 * 1000 trees. So it's not surprising that it takes a while, in particular without specifying max_depth.

amueller · 2018-11-12T20:22:35Z

Extrapolating from a using a single core with 10 trees, this should take about 30 minutes to finish.

amueller · 2018-11-12T20:25:20Z

Looks like setting n_jobs=-1 is slightly slower here because the trees are relatively small and it would take 55 minutes with n_jobs=-1 instead of 30 minutes with n_jobs=1. I don't see a bug, though.

hermidalc · 2018-11-13T02:11:50Z

I get high CPU usage for your example as well. wrt never finish:
I expect each tree takes a short time to build, but with RFE in there, you're building 100 * 1000 trees. So it's not surprising that it takes a while, in particular without specifying max_depth.

The behavior I'm seeing is exactly what #6023 mentioned with my example. If you look at the CPU usage it start sat 100% for all cores like it should, but then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. All cores should pretty much stay at 100% the entire time, even if it takes longer than with n_jobs=1, no?

amueller · 2018-11-13T17:47:05Z

@hermidalc no because the bottleneck is communication between the cores.

rajeshjnu2006 · 2019-12-12T00:40:14Z

After suffering from 100% CPU usage for many days, I came to this discussion board. Just changes n_jobs=-1 to n_jobs=4, it worked.

thomasjpfan · 2022-04-22T19:09:59Z

I am closing this issue because the answer to the original issue can be found at #12482 (comment) and #12482 (comment)

cmarmo added Performance module:ensemble labels Feb 6, 2022

thomasjpfan closed this as completed Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RandomForestClassifier parallel issues with CPU usage decreasing over run #12482

RandomForestClassifier parallel issues with CPU usage decreasing over run #12482

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RandomForestClassifier parallel issues with CPU usage decreasing over run #12482

RandomForestClassifier parallel issues with CPU usage decreasing over run #12482

Comments

Uh oh!

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!