-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
GridSearchCV Hanging for a simple default use #9746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
sounds very strange. The dataset size doesn't sound like the problem. and
you're only searching 24 parameters as far as I can see (note that range
uses exclusive upper bounds). Do you have the same problem for random data
of that shape
|
Thanks Joel. I know - dataset is not very big. But when I just tried running with only the first 20 features (416x20) instead of the 64K features I have, it did run. That was with just random forest classifier without any feature selection. When I tried inputting a pipeline as an estimator input to GridSearchCV, which is what I really need, it also worked (at k=20 features). Keep in mind, these are "it ran once" tests, not thorough unit testing. This suggests, this might be a problem with the dimensionality? Do I need to control the memory allocation for these jobs? |
Have the memory requirements been profiled under different scales (# samples, dimensionality, # jobs, etc). It probably is not necessary for small scales (1000s to 100,000), but it will be good to know when the implementation starts acting up on typical desktops, so we better understand the upper limits and breaking combinations. |
Maybe dimensionality, maybe something more weird. 64k features might indeed be slow to process if they are dense. |
Firstly, can I suggest that you set GridSearchCV's verbose flag, so that you can see Secondly, you can try running this for [100, 400, 1600, 6400, 25600, 65000] feature subsets, and seeing how the time scales. |
Sure, I was planning to do that myself, although the other way, start largest and reduce it until it works :) |
Yes, but if you get a curve from the first few points and extrapolation
shows that you'll need weeks to complete...
…On 13 September 2017 at 13:42, Pradeep Reddy Raamana < ***@***.***> wrote:
Sure, I was planning to do that myself, although the other way, start
largest and reduce it until it works :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9746 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-QHeRKo6V7wfo9IbVFpTB07UVY3ks5sh08agaJpZM4PVYZt>
.
|
@raamana thanks for the snippet. Just a comment: the "stand-alone" part in "stand-alone snippet" is very important, which means that basically I can just copy and paste the snippet and quickly see if I can reproduce the same behaviour. Your snippet depends on csv files that are only on your computer ... |
Sure, I wanted to reproduce the problem I had (requiring me to use my data), not something that can be simulated. Joel, when trying to run the grid search for different dimensionalities, I used the following code: It seems to quit after the first iteration - without any logs or errors. I tried running it a few times and it is still quitting after the first iteration. Outputs from this script are: https://pastebin.com/pXz7JAP7 Will try to use simulated data to see if this is something to do with my data. |
default_timer() reports seconds, not milliseconds. And I can't see any reason for it not to progress to a second iteration unless there was an error calling logging.info(log_msg) that you didn't see. |
It failed similarly even before I put logging.info statement. The log file doesn't show anything else:
|
same behaviour with simulated data btw using |
same behaviour even when I replace the pipeline (selectkbest with mutual_info followed by random forest) with plain random forest.. Its faster (obviously), but still quitting after the first iteration |
similar behaviour on my mac, too (previous attempts were on centos) |
If its a problem with the implementation, I can't be the only one reporting it? My guess is its much more likely, the problem is with my setup. I will look into the tests for grid search. |
I slightly simplified your snippet, made it stand-alone with some random data, and reduces the number of splits from 25 to 3. It runs in ~50s on my laptop: import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit
rf = RandomForestClassifier(max_features=10, n_estimators=10, oob_score=True)
param_grid = {'min_samples_leaf': range(1, 5, 2),
'max_features': range(1, 6, 2),
'n_estimators': range(50, 250, 50)}
inner_cv = ShuffleSplit(n_splits=3, train_size=0.5)
gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=inner_cv,
verbose=100)
rng = np.random.RandomState(0)
train_data = rng.rand(416, 70000)
train_labels = rng.randint(0, 2, 416)
gs.fit(train_data, train_labels) It keeps printing stuff on the console, which does not correspond to my definition of "hanging". if I look at one of the ouptut printed on the console:
With
A back of the envelope computation leads me to estimate that it would take this much for
|
I am going to close this one. It doesn't look to me like there is anything inherently wrong with scikit-learn. If you debug your problem further and spot a place where GridSearchCV or RandomForestClassifier is performing very badly, do feel free to reopen. |
@lesteve , possibly related to our
These three are all starting to look similar. |
Thanks Joel and lesteve. Thanks @mjbommar for linking to related issues. I'll look into them and see if I can find any common sources of issues. My problem doesn't seem to be 100% reproducible. |
Can you upload the data btw? That would help debugging (if anyone is interested). |
is this a multi-class or multi-label problem? Have you monitored memory consumption? It could be that your ram is filling up. |
This is a multi-class problem (4 classes). The size is large: 416 samples with 64K features, leading to 640MB when exported using numpy.savetxt. Will try to make it smaller by exporting in binary and look for appropriate places to upload big data files (any suggestions?) Optimizing on num_estimators wasn't the a goal - was just trying to see if larger forest was causing issues (becoming too demanding in cpu or ram), and if fewer number of trees would reduce the hangups. I will try to monitor the memory consumption, but I don't recall the computer being unresponsive, rest of the apps were running fine. |
This issue is with n_jobs=1 though, so not related. |
Following up on discussion in #5115 , I am unable to use GridSearchCV as it hangs there without throwing an error or warning.
Min code to reproduce:
Quick run of the above script, with results and software config:
I am unable to upload my files for some reason (bigger than 10MB - 412 samples with dimensions 64620). I don' think size is a causing the hangup as I let it run overnight, and it didn't even finish one
GridSearchCV.fit
call.I will upload it else where and post a link here soon, or try working on subset of it, to reduce its size.
The text was updated successfully, but these errors were encountered: