8000 MRG clone parameters in gridsearch etc by amueller · Pull Request #15096 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MRG clone parameters in gridsearch etc #15096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 29, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions sklearn/model_selection/_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -730,8 +730,10 @@ def evaluate_candidates(candidate_params):
self.best_params_ = results["params"][self.best_index_]

if self.refit:
self.best_estimator_ = clone(base_estimator).set_params(
**self.best_params_)
# we clone again after setting params in case some
# of the params are estimators as well.
self.best_estimator_ = clone(clone(base_estimator).set_params(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the following good enough?

clone(base_estimator.set_params(**self.best_params_))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so?
We're changing base_estimator then, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_grid_search_pipeline_steps test passes without the double clone. From

base_estimator = clone(self.estimator)
it is already a clone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could indeed only clone once as @thomasjpfan suggested since base_estimator is just a local variable, which isn't used later.

I guess cloning twice is fine too: no surprises.

**self.best_params_))
refit_start_time = time.time()
if y is not None:
self.best_estimator_.fit(X, y, **fit_params)
Expand Down
9 changes: 8 additions & 1 deletion sklearn/model_selection/_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -488,7 +488,14 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose,

train_scores = {}
if parameters is not None:
estimator.set_params(**parameters)
# clone after setting parameters in case any parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if someone had code relying on the existing behaviour.

Add a test for this wrt cross_validate??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I'm not sure what to do about that.
What do you want tested?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, parameters is not set by cross_validate. Could add a test for validation_curve. But I'm okay without.

# are estimators (like pipeline steps)
# because pipeline doesn't clone steps in fit
cloned_parameters = {}
for k, v in parameters.items():
cloned_parameters[k] = clone(v, safe=False)
Comment on lines +494 to +496
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dict comprehension?


estimator = estimator.set_params(**cloned_parameters)

start_time = time.time()

Expand Down
20 changes: 19 additions & 1 deletion sklearn/model_selection/tests/test_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
from sklearn.metrics import roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, SGDClassifier
from sklearn.linear_model import Ridge, SGDClassifier, LinearRegression

from sklearn.model_selection.tests.common import OneTimeSplitter

Expand Down Expand Up @@ -198,6 +198,24 @@ def test_grid_search():
assert_raises(ValueError, grid_search.fit, X, y)


def test_grid_search_pipeline_steps():
# check that parameters that are estimators are cloned before fitting
pipe = Pipeline([('regressor', LinearRegression())])
param_grid = {'regressor': [LinearRegression(), Ridge()]}
grid_search = GridSearchCV(pipe, param_grid, cv=2)
grid_search.fit(X, y)
regressor_results = grid_search.cv_results_['param_regressor']
assert isinstance(regressor_results[0], LinearRegression)
assert isinstance(regressor_results[1], Ridge)
assert not hasattr(regressor_results[0], 'coef_')
assert not hasattr(regressor_results[1], 'coef_')
assert regressor_results[0] is not grid_search.best_estimator_
assert regressor_results[1] is not grid_search.best_estimator_
# check that we didn't modify the parameter grid that was passed
assert not hasattr(param_grid['regressor'][0], 'coef_')
assert not hasattr(param_grid['regressor'][1], 'coef_')


def check_hyperparameter_searcher_with_fit_params(klass, **klass_kwargs):
X = np.arange(100).reshape(10, 10)
y = np.array([0] * 5 + [1] * 5)
Expand Down
0