-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Unexpected behavior when passing multiple parameter sets to RandomizedSearchCV #18057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can confirm that the effect seems consistent across random_state. Strange. |
When the grid is made of all lists it goes down a different path, which flattens the grid: scikit-learn/sklearn/model_selection/_search.py Lines 269 to 274 in 6a29e2c
When replacing one of the ranges with a scipy distribution it works as expected: from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import scipy
X, y = load_iris(return_X_y=True)
rf = RandomForestClassifier()
# 30 possible combinations
params1 = {}
params1['n_estimators'] = [10, 20, 30, 40, 50]
params1['min_samples_leaf'] = [1, 2, 3, 4, 5, 6]
# 120 possible combinations
params2 = {}
params2['n_estimators'] = [60, 70, 80]
params2['min_samples_leaf'] = scipy.stats.uniform(loc=0.25, scale=0.2)
params2['max_features'] = ['auto', None]
params2['bootstrap'] = [True, False]
params_both = [params1, params2]
rand = RandomizedSearchCV(rf, params_both, cv=2, scoring='accuracy', n_iter=50, random_state=1)
rand.fit(X, y)
print(sorted(rand.cv_results_['param_n_estimators'])) and outputs:
Fixing this means we are changing behavior. Should we consider this a bug? |
I think so. The easiest fix is probably something like all_lists = len(self.param_distributions) == 1 and all(
not hasattr(v, "rvs") for v in self.param_distributions[0].values()) But I'm also wondering why we have this weird sampling mechanism:
(Note that this was written long before we introduced support for lists of dicts.) This behaviour doesn't seem natural and I find it quite unexpected: why should the presence of a scipy dist influence the entire sampling of the dict? Sampling without replacement feels like we're trying to accommodate for cases where GridSearch should be used instead. Maybe we should deprecate this instead. But I'm not sure how we could do this smoothly? |
Sampling without replacement was introduced because the distribution has a
finite number of distinct samples.
I think this behaviour is surprising and should be considered a bug.
|
I find it a bit unexpected, especially considering that you can still have a finite number of candidates with discrete distributions anyway. Also, this behavior causes a bug in the |
So the more natural behavior would be to divide by the number of grid, and then sample without replacement from each flattened grid? If one of the grids is smaller than the actually allocated size, that'll create weird edge-cases, right? I don't see how this could be computed directly but I guess the naive algorithm to compute the allocations is O(n_dicts ** 2) which isn't so bad. Though you can sort them by size and then do a linear scan so it's O(n_dicts log (n_dicts)). |
take |
Describe the bug
Here is part of the documentation for the
param_distributions
parameter of RandomizedSearchCV:My interpretation is that if I pass a list of two dictionaries, then at each iteration:
I have found that that is not the case. Instead, I have found that if one of the two dictionaries has many more possible parameter combinations, then the larger dictionary will usually be chosen at each iteration.
Steps/Code to Reproduce
Expected Results
Since
n_iter=50
, I would expect thatparams1
andparams2
would each be chosen about 25 times.Actual Results
Here is the actual output of the last line:
As you can see,
params1
(which had values 10 through 50) was chosen only 9 times, and whereasparams2
(which had values 60 through 80) was chosen 41 times.Comments
There seem to be two possibilities:
RandomizedSearchCV
is the desired behavior. In that case, I would propose tweaking the documentation to make this behavior more clear.RandomizedSearchCV
is not the desired behavior. In that case, I would propose fixing the behavior so that it matches the documentation.I don't have a strong feeling about which behavior is the "optimal" behavior.
This was implemented in #14549 by @amueller, so he may have some insight on this!
Versions
The text was updated successfully, but these errors were encountered: