-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
The multi-threading issues on RandomForestClassifier #6023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you provide your dataset? |
Hi @olologin, My data is from a Kaggle contest (Homesite). You could download from https://www.kaggle.com/c/homesite-quote-conversion/data Btw, the config of my computer is 4-CPU with RAM 6400MB, OS is Ubuntu 14.04. Thank you very much for your help. |
Hmm, you didn't provide your code, but i'm guessing that it looks like:
I can reproduce it on Python3 and last sklearn, but i bet that it's not related to sklearn, because it's sort of IO problem, this dataset is too big in raw format, and i think that it's slow (and can't fully load your CPU) because of memory swapping in your OS, or because of joblib pickling/unpickling. But i think that swapping causes it, look at your RAM, if it becomes full after couple of minutes - that's it. |
Hi @olologin, Thank you very much for your reply and assistance. You have a good guess to complete the code 👍 , haha. What capacity of RAM do you suggest? I could change the config of my cluster. Thank you very much! |
@xchmiao , |
Honestly, i don't know, in ipython i can see: I see this in detailed log:
@xchmiao Did you apply some encoding to get rid of that text columns and object datatype as well (i.e. your x_train matrix is of numeric type or object type)? |
yes, I applied the following code to convert all the object type features to numeric type. for col in df.columns: But will this affect the requirement on the capacity of RAM? Also, when I check the status of RAM once the CPU usage drops to ~0.3%, it's not full. (see below status from command "top") KiB Mem: 6553600 total, 4579708 used, 1973892 free, 0 buffers So really confused... |
Hi David, Thank you very much for your comment. I tried with your suggestion, but it doesn't solve the problem. I think it's more or less related to the insufficient RAM capacity issue as mentioned by Ganiev@olologin. If I trained with 1000 data sets (or observations), there is no problem. But I will still appreciate if anyone could give me some guidance of the needed RAM capacity. Thanks again. |
I know that there are issues with nested multithreading, but I am unaware what specifically the issues are. Try doing cross validation with a single job, and using all your processing at the RF fitting level. |
you should only use parallelism on one level, in this case probably the |
As of 0.19.2 this issue doesn't appear to be fixed? I encountered it not with GridSearchCV but using RF wrapped in RFE. I get the exact same strange behavior where parallel CPU usage starts like it should at 100% and then steadily decreases to low numbers while system CPU usage (in Linux shown in top) increases to 10-15% CPU per core which is not normal. This simple pipeline will reproduce this issue if you give it a large enough dataset so that you observe the behavior: pipe = Pipeline([
('slr', StandardScaler()),
('fs', RFE(RandomForestClassifier(n_estimato
8000
rs=1000, max_features='auto', class_weight='balanced', n_jobs=-1), n_features_to_select=10))
])
pipe.fit(X, y) |
@hermidalc please open an new issue with complete code to reproduce and your system specs. Also, please upgrade to version 0.20 |
Hi @amueller - will do, although I cannot upgrade yet to 0.20 since I use a dependency, scikit-survival, that needs 0.19.x. I put in an ticket on that project to add support for 0.20 but it might take a while. |
Hi,
I'm using RandomForestClassifier to train a model on Ubuntu14.04 with python2.7.11 thru anaconda package. Below is the core coding:
rf = RandomForestClassifier(n_jobs = -1, random_state = seed)
parameters = {'n_estimators': [2000],
'criterion': ['entropy'],
'max_depth': [10],
'min_samples_leaf': [3],
#'oob_score': [False, True],
'max_features': ['auto']}
print "Start paramter grid search..."
start = time()
clf = GridSearchCV(rf, parameters, n_jobs = 4, scoring = 'roc_auc',
cv = StratifiedKFold(y_train, n_folds = 4, shuffle = True, random_state = 128),
verbose=2, refit = True)
clf.fit(x_train, y_train)
I turned on the CPU monitor to watch the CPU status on a quad-core system. However, in the beginning, all CPUs are 99% used. However, after ~1hr, the CPU's usage drops to ~0.3%, which does not seem normal.
Below is the output on from the terminal:
Fitting 4 folds for each of 1 candidates, totalling 4 fits
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max_
n_samples_leaf=3
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max
in_samples_leaf=3 -68.4min
[Parallel(n_jobs=4)]: Done 1 jobs | elapsed: 68.4min
[CV] max_features=auto, n_estimators=2000, criterion=entropy, max
in_samples_leaf=3 -68.8min
Below is the status of CPU usage:
top - 01:40:46 up 4:40, 0 users, load average: 0.00, 0.00, 0.95
Tasks: 66 total, 1 running, 65 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 6553600 total, 4579708 used, 1973892 free, 0 buffers
KiB Swap: 6553600 total, 5511600 used, 1042000 free. 36652 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
994 root 20 0 23116 60 60 S 0.0 0.0 0:00.00 ptyserved
997 root 20 0 39372 0 0 S 0.0 0.0 0:00.01 nginx
1000 root 20 0 39876 924 536 S 0.0 0.0 0:01.75 nginx
1002 root 20 0 12736 0 0 S 0.0 0.0 0:00.00 getty
1004 root 20 0 12736 0 0 S 0.0 0.0 0:00.00 getty
1435 root 20 0 18144 24 24 S 0.0 0.0 0:00.00 bash
1455 root 20 0 59568 0 0 S 0.0 0.0 0:00.00 su
1456 root 20 0 18140 0 0 S 0.0 0.0 0:00.00 bash
1467 root 20 0 21916 800 504 R 0.0 0.0 0:12.72 top
1888 root 20 0 61316 4 4 S 0.0 0.0 0:00.00 sshd
1950 postfix 20 0 27408 272 184 S 0.0 0.0 0:00.05 qmgr
2062 root 20 0 59568 92 92 S 0.0 0.0 0:00.00 su
2063 root 20 0 18144 60 60 S 0.0 0.0 0:00.00 bash
2074 root 20 0 5861480 3260 888 S 0.0 0.0 1:26.11 python***
2087 root 20 0 8268200 2.020g 216 S 0.0 32.3 66:39.13 python***
2090 root 20 0 8268200 2.034g 1676 S 0.0 32.5 66:38.78 python***
2184 postfix 20 0 27356 524 240 S 0.0 0.0 0:00.00 pickup
2210 root 20 0 5861480 2836 104 S 0.0 0.0 0:00.00 python ****
2225 root 20 0 5861480 4040 820 S 0.0 0.1 0:00.00 python ****
Although the training data is about 207MB with 300 features, the CPU usage drop doesn't seem to usual.
Do you know what is going on?
Thank you very much!
The text was updated successfully, but these errors were encountered: