8000 GridSearchCV Hanging for a simple default use · Issue #9746 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

GridSearchCV Hanging for a simple default use #9746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
raamana opened this issue Sep 12, 2017 · 23 comments
Closed

GridSearchCV Hanging for a simple default use #9746

raamana opened this issue Sep 12, 2017 · 23 comments

Comments

@raamana
Copy link
Contributor
raamana commented Sep 12, 2017

Following up on discussion in #5115 , I am unable to use GridSearchCV as it hangs there without throwing an error or warning.

Min code to reproduce:

from os.path import join as pjoin
import sys
import timeit

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit

rf = RandomForestClassifier(max_features=10, n_estimators=10, oob_score=True)
param_grid = {'min_samples_leaf': range(1, 5, 2),
              'max_features': range(1, 6, 2),
              'n_estimators': range(50, 250, 50)}

inner_cv = ShuffleSplit(n_splits=25, train_size=0.5)
gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=inner_cv)

cur_dir = '.'
train_data_mat = np.loadtxt(pjoin(cur_dir, 'JS_sklearn_test.csv'))
train_labels = np.genfromtxt(pjoin(cur_dir, 'labels_sklearn_test.txt'), dtype='int')

print(gs)

start = timeit.default_timer()
print(start)

try:
    gs.fit(train_data_mat, train_labels)
except:
    stop = timeit.default_timer()
    print(stop-start)
    sys.exit(1)

print(gs.best_score_)
print(gs.best_params_)

Quick run of the above script, with results and software config:

$ 19:03:59 miner min_example_gs >>  python test.py 
GridSearchCV(cv=ShuffleSplit(n_splits=25, random_state=None, test_size=0.1, train_size=0.5),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=10, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=True, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_leaf': range(1, 5, 2), 'max_features': range(1, 6, 2), 'n_estimators': range(50, 250, 50)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
5818976.563907812
^C151.3223996711895
$ 19:07:05 miner min_example_gs >>  python --version
Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
$ 19:07:09 miner min_example_gs >>  ipython
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 

IPython 5.3.0 -- An enhanced Interactive Python.

In [2]: import numpy

In [3]: numpy.__config__.show()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

In [4]: import sklearn

In [5]: sklearn.__version__
Out[5]: '0.18.1'

I am unable to upload my files for some reason (bigger than 10MB - 412 samples with dimensions 64620). I don' think size is a causing the hangup as I let it run overnight, and it didn't even finish one GridSearchCV.fit call.

I will upload it else where and post a link here soon, or try working on subset of it, to reduce its size.

@jnothman
Copy link
Member
jnothman commented Sep 12, 2017 via email

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

Thanks Joel. I know - dataset is not very big. But when I just tried running with only the first 20 features (416x20) instead of the 64K features I have, it did run. That was with just random forest classifier without any feature selection. When I tried inputting a pipeline as an estimator input to GridSearchCV, which is what I really need, it also worked (at k=20 features). Keep in mind, these are "it ran once" tests, not thorough unit testing. This suggests, this might be a problem with the dimensionality? Do I need to control the memory allocation for these jobs?

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

Have the memory requirements been profiled under different scales (# samples, dimensionality, # jobs, etc). It probably is not necessary for small scales (1000s to 100,000), but it will be good to know when the implementation starts acting up on typical desktops, so we better understand the upper limits and breaking combinations.

@jnothman
Copy link
Member

Maybe dimensionality, maybe something more weird. 64k features might indeed be slow to process if they are dense.

@jnothman
Copy link
Member

Firstly, can I suggest that you set GridSearchCV's verbose flag, so that you can see fits completing as things progress... or so that you can see that they don't progress.

Secondly, you can try running this for [100, 400, 1600, 6400, 25600, 65000] feature subsets, and seeing how the time scales.

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

Sure, I was planning to do that myself, although the other way, start largest and reduce it until it works :)

@jnothman
Copy link
Member
jnothman commented Sep 13, 2017 via email

@lesteve
Copy link
Member
lesteve commented Sep 13, 2017

@raamana thanks for the snippet. Just a comment: the "stand-alone" part in "stand-alone snippet" is very important, which means that basically I can just copy and paste the snippet and quickly see if I can reproduce the same behaviour. Your snippet depends on csv files that are only on your computer ...

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

Sure, I wanted to reproduce the problem I had (requiring me to use my data), not something that can be simulated.

Joel, when trying to run the grid search for different dimensionalities, I used the following code:
https://gist.github.com/raamana/24a1c8ed0cc4d66944742ff96ec4c510

It seems to quit after the first iteration - without any logs or errors. I tried running it a few times and it is still quitting after the first iteration. Outputs from this script are: https://pastebin.com/pXz7JAP7

Will try to use simulated data to see if this is something to do with my data.

@jnothman
Copy link
Member

default_timer() reports seconds, not milliseconds.

And I can't see any reason for it not to progress to a second iteration unless there was an error calling logging.info(log_msg) that you didn't see.

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

It failed similarly even before I put logging.info statement. The log file doesn't show anything else:

gridsearch at dimensionality (416, 100) just done after 305.9148793462664 msecs.
 Best score: 0.5847619047619048
Best params: {'random_forest_clf__max_features': 1, 'random_forest_clf__min_samples_leaf': 3, 'random_forest_clf__n_estimators': 100}

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

same behaviour with simulated data btw using train_data_full = np.random.rand(416,70000)

@raamana
Copy link
Contributor Aut 8000 hor
raamana commented Sep 13, 2017

same behaviour even when I replace the pipeline (selectkbest with mutual_info followed by random forest) with plain random forest.. Its faster (obviously), but still quitting after the first iteration

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

similar behaviour on my mac, too (previous attempts were on centos)

@raamana
Copy link
Contributor Author
raamana commented Sep 13, 2017

If its a problem with the implementation, I can't be the only one reporting it? My guess is its much more likely, the problem is with my setup. I will look into the tests for grid search.

@lesteve
Copy link
Member
lesteve commented Sep 13, 2017

I slightly simplified your snippet, made it stand-alone with some random data, and reduces the number of splits from 25 to 3. It runs in ~50s on my laptop:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit

rf = RandomForestClassifier(max_features=10, n_estimators=10, oob_score=True)
param_grid = {'min_samples_leaf': range(1, 5, 2),
              'max_features': range(1, 6, 2),
              'n_estimators': range(50, 250, 50)}

inner_cv = ShuffleSplit(n_splits=3, train_size=0.5)
gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=inner_cv,
                  verbose=100)

rng = np.random.RandomState(0)
train_data = rng.rand(416, 70000)
train_labels = rng.randint(0, 2, 416)

gs.fit(train_data, train_labels)

It keeps printing stuff on the console, which does not correspond to my definition of "hanging".

if I look at one of the ouptut printed on the console:

Fitting 3 folds for each of 24 candidates, totalling 72 fits

With n_splits=25 I get the following output:

Fitting 25 folds for each of 24 candidates, totalling 600 fits

A back of the envelope computation leads me to estimate that it would take this much for n_splits=25:

50s * 600 / 72 = 416s ~= 7 hours

@lesteve
Copy link
Member
lesteve commented Sep 14, 2017

I am going to close this one. It doesn't look to me like there is anything inherently wrong with scikit-learn. If you debug your problem further and spot a place where GridSearchCV or RandomForestClassifier is performing very badly, do feel free to reopen.

@lesteve lesteve closed this as completed Sep 14, 2017
@mjbommar
Copy link
Contributor
mjbommar commented Sep 15, 2017

@raamana
Copy link
Contributor Author
raamana commented Sep 15, 2017

Thanks Joel and lesteve.

Thanks @mjbommar for linking to related issues. I'll look into them and see if I can find any common sources of issues. My problem doesn't seem to be 100% reproducible.

@amueller
Copy link
Member

Can you upload the data btw? That would help debugging (if anyone is interested).
Also, grid-searching over number of estimators in random forests doesn't really make a lot of sense. Performance increased with more estimators, but with diminishing returns. So if you're willing to run 250 estimators, just set that number.

@amueller
Copy link
Member

is this a multi-class or multi-label problem? Have you monitored memory consumption? It could be that your ram is filling up.

@raamana
Copy link
Contributor Author
raamana commented Sep 16, 2017

This is a multi-class problem (4 classes). The size is large: 416 samples with 64K features, leading to 640MB when exported using numpy.savetxt. Will try to make it smaller by exporting in binary and look for appropriate places to upload big data files (any suggestions?)

Optimizing on num_estimators wasn't the a goal - was just trying to see if larger forest was causing issues (becoming too demanding in cpu or ram), and if fewer number of trees would reduce the hangups.

I will try to monitor the memory consumption, but I don't recall the computer being unresponsive, rest of the apps were running fine.

@lesteve
Copy link
Member
lesteve commented Sep 18, 2017

@lesteve , possibly related to our bootstrap=True / numpy.random notes?

This issue is with n_jobs=1 though, so not related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@jnothman @mjbommar @amueller @lesteve @raamana and others
0