GridSearchCV Hanging for a simple default use #9746

raamana · 2017-09-12T23:21:47Z

Following up on discussion in #5115 , I am unable to use GridSearchCV as it hangs there without throwing an error or warning.

Min code to reproduce:

from os.path import join as pjoin
import sys
import timeit

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit

rf = RandomForestClassifier(max_features=10, n_estimators=10, oob_score=True)
param_grid = {'min_samples_leaf': range(1, 5, 2),
              'max_features': range(1, 6, 2),
              'n_estimators': range(50, 250, 50)}

inner_cv = ShuffleSplit(n_splits=25, train_size=0.5)
gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=inner_cv)

cur_dir = '.'
train_data_mat = np.loadtxt(pjoin(cur_dir, 'JS_sklearn_test.csv'))
train_labels = np.genfromtxt(pjoin(cur_dir, 'labels_sklearn_test.txt'), dtype='int')

print(gs)

start = timeit.default_timer()
print(start)

try:
    gs.fit(train_data_mat, train_labels)
except:
    stop = timeit.default_timer()
    print(stop-start)
    sys.exit(1)

print(gs.best_score_)
print(gs.best_params_)

Quick run of the above script, with results and software config:

$ 19:03:59 miner min_example_gs >>  python test.py 
GridSearchCV(cv=ShuffleSplit(n_splits=25, random_state=None, test_size=0.1, train_size=0.5),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=10, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=True, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_leaf': range(1, 5, 2), 'max_features': range(1, 6, 2), 'n_estimators': range(50, 250, 50)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
5818976.563907812
^C151.3223996711895
$ 19:07:05 miner min_example_gs >>  python --version
Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
$ 19:07:09 miner min_example_gs >>  ipython
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 

IPython 5.3.0 -- An enhanced Interactive Python.

In [2]: import numpy

In [3]: numpy.__config__.show()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

In [4]: import sklearn

In [5]: sklearn.__version__
Out[5]: '0.18.1'

I am unable to upload my files for some reason (bigger than 10MB - 412 samples with dimensions 64620). I don' think size is a causing the hangup as I let it run overnight, and it didn't even finish one GridSearchCV.fit call.

I will upload it else where and post a link here soon, or try working on subset of it, to reduce its size.

The text was updated successfully, but these errors were encountered:

jnothman · 2017-09-12T23:49:57Z

sounds very strange. The dataset size doesn't sound like the problem. and you're only searching 24 parameters as far as I can see (note that range uses exclusive upper bounds). Do you have the same problem for random data of that shape

raamana · 2017-09-13T02:48:15Z

Thanks Joel. I know - dataset is not very big. But when I just tried running with only the first 20 features (416x20) instead of the 64K features I have, it did run. That was with just random forest classifier without any feature selection. When I tried inputting a pipeline as an estimator input to GridSearchCV, which is what I really need, it also worked (at k=20 features). Keep in mind, these are "it ran once" tests, not thorough unit testing. This suggests, this might be a problem with the dimensionality? Do I need to control the memory allocation for these jobs?

raamana · 2017-09-13T03:02:17Z

Have the memory requirements been profiled under different scales (# samples, dimensionality, # jobs, etc). It probably is not necessary for small scales (1000s to 100,000), but it will be good to know when the implementation starts acting up on typical desktops, so we better understand the upper limits and breaking combinations.

jnothman · 2017-09-13T03:02:41Z

Maybe dimensionality, maybe something more weird. 64k features might indeed be slow to process if they are dense.

jnothman · 2017-09-13T03:19:56Z

Firstly, can I suggest that you set GridSearchCV's verbose flag, so that you can see fits completing as things progress... or so that you can see that they don't progress.

Secondly, you can try running this for [100, 400, 1600, 6400, 25600, 65000] feature subsets, and seeing how the time scales.

raamana · 2017-09-13T03:42:15Z

Sure, I was planning to do that myself, although the other way, start largest and reduce it until it works :)

jnothman · 2017-09-13T03:55:37Z

Yes, but if you get a curve from the first few points and extrapolation shows that you'll need weeks to complete...

…

On 13 September 2017 at 13:42, Pradeep Reddy Raamana < ***@***.***> wrote: Sure, I was planning to do that myself, although the other way, start largest and reduce it until it works :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9746 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-QHeRKo6V7wfo9IbVFpTB07UVY3ks5sh08agaJpZM4PVYZt> .

lesteve · 2017-09-13T11:47:48Z

@raamana thanks for the snippet. Just a comment: the "stand-alone" part in "stand-alone snippet" is very important, which means that basically I can just copy and paste the snippet and quickly see if I can reproduce the same behaviour. Your snippet depends on csv files that are only on your computer ...

raamana · 2017-09-13T12:58:08Z

Sure, I wanted to reproduce the problem I had (requiring me to use my data), not something that can be simulated.

Joel, when trying to run the grid search for different dimensionalities, I used the following code:
https://gist.github.com/raamana/24a1c8ed0cc4d66944742ff96ec4c510

It seems to quit after the first iteration - without any logs or errors. I tried running it a few times and it is still quitting after the first iteration. Outputs from this script are: https://pastebin.com/pXz7JAP7

Will try to use simulated data to see if this is something to do with my data.

jnothman · 2017-09-13T13:13:45Z

default_timer() reports seconds, not milliseconds.

And I can't see any reason for it not to progress to a second iteration unless there was an error calling logging.info(log_msg) that you didn't see.

raamana · 2017-09-13T13:30:31Z

It failed similarly even before I put logging.info statement. The log file doesn't show anything else:

gridsearch at dimensionality (416, 100) just done after 305.9148793462664 msecs.
 Best score: 0.5847619047619048
Best params: {'random_forest_clf__max_features': 1, 'random_forest_clf__min_samples_leaf': 3, 'random_forest_clf__n_estimators': 100}

raamana · 2017-09-13T13:32:15Z

same behaviour with simulated data btw using train_data_full = np.random.rand(416,70000)

raamana · 2017-09-13T13:48:52Z

same behaviour even when I replace the pipeline (selectkbest with mutual_info followed by random forest) with plain random forest.. Its faster (obviously), but still quitting after the first iteration

raamana · 2017-09-13T14:39:09Z

similar behaviour on my mac, too (previous attempts were on centos)

raamana · 2017-09-13T14:42:09Z

If its a problem with the implementation, I can't be the only one reporting it? My guess is its much more likely, the problem is with my setup. I will look into the tests for grid search.

lesteve · 2017-09-13T15:02:35Z

I slightly simplified your snippet, made it stand-alone with some random data, and reduces the number of splits from 25 to 3. It runs in ~50s on my laptop:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit

rf = RandomForestClassifier(max_features=10, n_estimators=10, oob_score=True)
param_grid = {'min_samples_leaf': range(1, 5, 2),
              'max_features': range(1, 6, 2),
              'n_estimators': range(50, 250, 50)}

inner_cv = ShuffleSplit(n_splits=3, train_size=0.5)
gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=inner_cv,
                  verbose=100)

rng = np.random.RandomState(0)
train_data = rng.rand(416, 70000)
train_labels = rng.randint(0, 2, 416)

gs.fit(train_data, train_labels)

It keeps printing stuff on the console, which does not correspond to my definition of "hanging".

if I look at one of the ouptut printed on the console:

Fitting 3 folds for each of 24 candidates, totalling 72 fits

With n_splits=25 I get the following output:

Fitting 25 folds for each of 24 candidates, totalling 600 fits

A back of the envelope computation leads me to estimate that it would take this much for n_splits=25:

50s * 600 / 72 = 416s ~= 7 hours

lesteve · 2017-09-14T08:35:20Z

I am going to close this one. It doesn't look to me like there is anything inherently wrong with scikit-learn. If you debug your problem further and spot a place where GridSearchCV or RandomForestClassifier is performing very badly, do feel free to reopen.

mjbommar · 2017-09-15T21:38:40Z

@lesteve , possibly related to our bootstrap=True / numpy.random notes?

These three are all starting to look similar.

raamana · 2017-09-15T22:52:03Z

Thanks Joel and lesteve.

Thanks @mjbommar for linking to related issues. I'll look into them and see if I can find any common sources of issues. My problem doesn't seem to be 100% reproducible.

amueller · 2017-09-15T23:06:50Z

Can you upload the data btw? That would help debugging (if anyone is interested).
Also, grid-searching over number of estimators in random forests doesn't really make a lot of sense. Performance increased with more estimators, but with diminishing returns. So if you're willing to run 250 estimators, just set that number.

amueller · 2017-09-15T23:07:50Z

is this a multi-class or multi-label problem? Have you monitored memory consumption? It could be that your ram is filling up.

raamana · 2017-09-16T03:58:52Z

This is a multi-class problem (4 classes). The size is large: 416 samples with 64K features, leading to 640MB when exported using numpy.savetxt. Will try to make it smaller by exporting in binary and look for appropriate places to upload big data files (any suggestions?)

Optimizing on num_estimators wasn't the a goal - was just trying to see if larger forest was causing issues (becoming too demanding in cpu or ram), and if fewer number of trees would reduce the hangups.

I will try to monitor the memory consumption, but I don't recall the computer being unresponsive, rest of the apps were running fine.

lesteve · 2017-09-18T05:48:51Z

@lesteve , possibly related to our bootstrap=True / numpy.random notes?

This issue is with n_jobs=1 though, so not related.

lesteve closed this as completed Sep 14, 2017

mjbommar mentioned this issue Sep 15, 2017

Deadlock using GridSearchCV with RandomForestClassifier #3115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSearchCV Hanging for a simple default use #9746

GridSearchCV Hanging for a simple default use #9746

GridSearchCV Hanging for a simple default use #9746

GridSearchCV Hanging for a simple default use #9746

Comments

Min code to reproduce: