8000 The multi-threading issues on RandomForestClassifier · Issue #6023 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

The multi-threading issues on RandomForestClassifier #6023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xchmiao opened this issue Dec 14, 2015 · 13 comments
Closed

The multi-threading issues on RandomForestClassifier #6023

xchmiao opened this issue Dec 14, 2015 · 13 comments

Comments

@xchmiao
Copy link
xchmiao commented Dec 14, 2015

Hi,

I'm using RandomForestClassifier to train a model on Ubuntu14.04 with python2.7.11 thru anaconda package. Below is the core coding:

rf = RandomForestClassifier(n_jobs = -1, random_state = seed)
parameters = {'n_estimators': [2000],
'criterion': ['entropy'],
'max_depth': [10],
'min_samples_leaf': [3],
#'oob_score': [False, True],
'max_features': ['auto']}

print "Start paramter grid search..."
start = time()
clf = GridSearchCV(rf, parameters, n_jobs = 4, scoring = 'roc_auc',
cv = StratifiedKFold(y_train, n_folds = 4, shuffle = True, random_state = 128),
verbose=2, refit = True)

clf.fit(x_train, y_train)

I turned on the CPU monitor to watch the CPU status on a quad-core system. However, in the beginning, all CPUs are 99% used. However, after ~1hr, the CPU's usage drops to ~0.3%, which does not seem normal.

Below is the output on from the terminal:


Fitting 4 folds for each of 1 candidates, totalling 4 fits
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max
in_samples_leaf=3 -68.4min
[Parallel(n_jobs=4)]: Done 1 jobs | elapsed: 68.4min
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max
in_samples_leaf=3 -68.8min


Below is the status of CPU usage:


top - 01:40:46 up 4:40, 0 users, load average: 0.00, 0.00, 0.95
Tasks: 66 total, 1 running, 65 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 6553600 total, 4579708 used, 1973892 free, 0 buffers
KiB Swap: 6553600 total, 5511600 used, 1042000 free. 36652 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
994 root 20 0 23116 60 60 S 0.0 0.0 0:00.00 ptyserved
997 root 20 0 39372 0 0 S 0.0 0.0 0:00.01 nginx
1000 root 20 0 39876 924 536 S 0.0 0.0 0:01.75 nginx
1002 root 20 0 12736 0 0 S 0.0 0.0 0:00.00 getty
1004 root 20 0 12736 0 0 S 0.0 0.0 0:00.00 getty
1435 root 20 0 18144 24 24 S 0.0 0.0 0:00.00 bash
1455 root 20 0 59568 0 0 S 0.0 0.0 0:00.00 su
1456 root 20 0 18140 0 0 S 0.0 0.0 0:00.00 bash
1467 root 20 0 21916 800 504 R 0.0 0.0 0:12.72 top
1888 root 20 0 61316 4 4 S 0.0 0.0 0:00.00 sshd
1950 postfix 20 0 27408 272 184 S 0.0 0.0 0:00.05 qmgr
2062 root 20 0 59568 92 92 S 0.0 0.0 0:00.00 su
2063 root 20 0 18144 60 60 S 0.0 0.0 0:00.00 bash
2074 root 20 0 5861480 3260 888 S 0.0 0.0 1:26.11 python***
2087 root 20 0 8268200 2.020g 216 S 0.0 32.3 66:39.13 python***
2090 root 20 0 8268200 2.034g 1676 S 0.0 32.5 66:38.78 python***
2184 postfix 20 0 27356 524 240 S 0.0 0.0 0:00.00 pickup
2210 root 20 0 5861480 2836 104 S 0.0 0.0 0:00.00 python ****
2225 root 20 0 5861480 4040 820 S 0.0 0.1 0:00.00 python ****


Although the training data is about 207MB with 300 features, the CPU usage drop doesn't seem to usual.

Do you know what is going on?

Thank you very much!

@olologin
Copy link
Contributor

Could you provide your dataset?

@xchmiao
Copy link
Author
xchmiao commented Dec 14, 2015

Hi @olologin,

My data is from a Kaggle contest (Homesite). You could download from https://www.kaggle.com/c/homesite-quote-conversion/data

Btw, the config of my computer is 4-CPU with RAM 6400MB, OS is Ubuntu 14.04.

Thank you very much for your help.

@olologin
Copy link
Contributor

Hmm, you didn't provide your code, but i'm guessing that it looks like:

from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
import pandas as pd
import numpy as np

df = pd.read_csv("train.csv")
y_train = np.asarray(df.pop("QuoteConversion_Flag"))
x_train = df.as_matrix()

del df

seed = 128
rf = RandomForestClassifier(n_jobs = -1, random_state = seed)
parameters = {'n_estimators': [2000],
'criterion': ['entropy'],
'max_depth': [10],
'min_samples_leaf': [3],
#'oob_score': [False, True],
'max_features': ['auto']}

print ("Start paramter grid search...")

cv_ = StratifiedKFold(y_train, 4, shuffle = True, random_state = seed)
clf = GridSearchCV(rf, parameters, n_jobs = 4, scoring = 'roc_auc', cv = cv_, verbose=100, refit = True)

clf.fit(x_train, y_train)

I can reproduce it on Python3 and last sklearn, but i bet that it's not related to sklearn, because it's sort of IO problem, this dataset is too big in raw format, and i think that it's slow (and can't fully load your CPU) because of memory swapping in your OS, or because of joblib pickling/unpickling. But i think that swapping causes it, look at your RAM, if it becomes full after couple of minutes - that's it.

@xchmiao
Copy link
Author
xchmiao commented Dec 15, 2015

Hi @olologin,

Thank you very much for your reply and assistance. You have a good guess to complete the code 👍 , haha.
I just have one more question regarding the RAM. Right now I'm using RAM = 6400MB, wouldn't that be enough for the training?

What capacity of RAM do you suggest? I could change the config of my cluster. Thank you very much!

@davidthaler
Copy link
Contributor

@xchmiao ,
I see you are using parallelism at 2 levels: njobs=-1 on the RF, and njobs=4 on the GridSearch. I think that sometimes causes some problems. Does this problem still occur if you set njobs=1 in GridSearch?

@olologin
Copy link
Contributor

@xchmiao

What capacity of RAM do you suggest?

Honestly, i don't know, in ipython i can see:
x_train ndarray 260753x298: 77704394 elems, type object, 621635152 bytes (592.8374786376953 Mb)
y_train ndarray 260753: 260753 elems, type int64, 2086024 bytes (1.9893875122070312 Mb)
But still 6Gb of ram is not enough :)

I see this in detailed log:

usr/local/lib/python3.4/dist-packages/sklearn/cross_validation.py:42: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/usr/local/lib/python3.4/dist-packages/sklearn/grid_search.py:43: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
Start paramter grid search...
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Pickling array (shape=(260753, 298), dtype=object).
Memmaping (shape=(260753,), dtype=int64) to new file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140105433248352-0.pkl
Memmaping (shape=(195564,), dtype=int64) to new file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140103680497104-0.pkl
Pickling array (shape=(65189,), dtype=int64).
Pickling array (shape=(260753, 298), dtype=object).
[CV] criterion=entropy, max_features=auto, n_estimators=2000, max_depth=10, min_samples_leaf=3 
Memmaping (shape=(260753,), dtype=int64) to old file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140105433248352-0.pkl
Memmaping (shape=(195564,), dtype=int64) to new file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140103680498304-0.pkl
Pickling array (shape=(65189,), dtype=int64).
Pickling array (shape=(260753, 298), dtype=object).
Memmaping (shape=(260753,), dtype=int64) to old file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140105433248352-0.pkl
Memmaping (shape=(195565,), dtype=int64) to new file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140103680497744-0.pkl
[CV] criterion=entropy, max_features=auto, n_estimators=2000, max_depth=10, min_samples_leaf=3 
Pickling array (shape=(65188,), dtype=int64).
Pickling array (shape=(260753, 298), dtype=object).
Memmaping (shape=(260753,), dtype=int64) to old file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140105433248352-0.pkl
Memmaping (shape=(195566,), dtype=int64) to new file /dev/shm/joblib_memmaping_pool_5777_140105470744112/5777-140103680508592-140103680498624-0.pkl
Pickling array (shape=(65187,), dtype=int64).
[CV] criterion=entropy, max_features=auto, n_estimators=2000, max_depth=10, min_samples_leaf=3 
[CV] criterion=entropy, max_features=auto, n_estimators=2000, max_depth=10, min_samples_leaf=3 
multiprocessing.pool.RemoteTraceback: 
...

Pickling array (shape=(260753, 298), dtype=object).
This line tells us that joblib tries to Pickle x_train matrix each time (not even memmap) when it spawns new job, and as i've figured it - it's because x_train is of type numpy.object.

@xchmiao Did you apply some encoding to get rid of that text columns and object datatype as well (i.e. your x_train matrix is of numeric type or object type)?

@xchmiao
Copy link
Author
xchmiao commented Dec 15, 2015

@olologin

yes, I applied the following code to convert all the object type features to numeric type.

for col in df.columns:
if df[col].dtype == 'object':
print col
lbl = preprocessing.LabelEncoder()
lbl.fit(list(df[col].values))
df[col] = lbl.transform(list(df[col].values))

But will this affect the requirement on the capacity of RAM?

Also, when I check the status of RAM once the CPU usage drops to ~0.3%, it's not full. (see below status from command "top")

KiB Mem: 6553600 total, 4579708 used, 1973892 free, 0 buffers
KiB Swap: 6553600 total, 5511600 used, 1042000 free. 36652 cached Mem

So really confused...

@xchmiao
Copy link
Author
xchmiao commented Dec 15, 2015

@davidthaler

Hi David,

Thank you very much for your comment. I tried with your suggestion, but it doesn't solve the problem. I think it's more or less related to the insufficient RAM capacity issue as mentioned by Ganiev@olologin. If I trained with 1000 data sets (or observations), there is no problem.

But I will still appreciate if anyone could give me some guidance of the needed RAM capacity.

Thanks again.

@jmschrei
Copy link
Member

I know that there are issues with nested multithreading, but I am unaware what specifically the issues are. Try doing cross validation with a single job, and using all your processing at the RF fitting level.

@amueller
Copy link
Member

you should only use parallelism on one level, in this case probably the RandomForest.

@hermidalc
Copy link
Contributor

As of 0.19.2 this issue doesn't appear to be fixed? I encountered it not with GridSearchCV but using RF wrapped in RFE. I get the exact same strange behavior where parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. This simple pipeline will reproduce this issue if you give it a large enough dataset so that you observe the behavior:

 pipe = Pipeline([
    ('slr', StandardScaler()),
    ('fs', RFE(RandomForestClassifier(n_estimato
8000
rs=1000, max_features='auto', class_weight='balanced', n_jobs=-1), n_features_to_select=10))
])
pipe.fit(X, y)

@amueller
Copy link
Member

@hermidalc please open an new issue with complete code to reproduce and your system specs. Also, please upgrade to version 0.20

@hermidalc
Copy link
Contributor

Hi @amueller - will do, although I cannot upgrade yet to 0.20 since I use a dependency, scikit-survival, that needs 0.19.x. I put in an ticket on that project to add support for 0.20 but it might take a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
0