8000 FIX make deepcopy of parameters in ensemble to avoid sharing mutable instances by aadamson · Pull Request #18985 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX make deepcopy of parameters in ensemble to avoid sharing mutable instances #18985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,15 @@ Changelog
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`.
where 123456 is the *pull request* number, not the issue number.

:mod:`sklearn.ensemble`
.......................

- |Fix| Fix a bug in `_make_estimator` of :class:`ensemble.BaseEnsemble` that
caused segmentation faults under certain conditions. `_make_estimator` now
deep copies each element of `estimator_params` when creating estimators
to prevent shared access to underlying mutable attributes in the parameters.
:pr:`18985` by :user:`Alex Adamson <aadamson>` and
:user:`Wil Yegelwel <wyegelwel>`.


Code and Documentation Contributors
Expand Down
5 changes: 4 additions & 1 deletion sklearn/ensemble/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# License: BSD 3 clause

from abc import ABCMeta, abstractmethod
import copy
import numbers
from typing import List

Expand Down Expand Up @@ -148,7 +149,9 @@ def _make_estimator(self, append=True, random_state=None):
sub-estimators.
"""
estimator = clone(self.base_estimator_)
estimator.set_params(**{p: getattr(self, p)
# Make a deepcopy in case one of the base estimators has a mutable
# parameter that might be shared and modified during parallel fitting
estimator.set_params(**{p: copy.deepcopy(getattr(self, p))
for p in self.estimator_params})

if random_state is not None:
Expand Down
19 changes: 19 additions & 0 deletions sklearn/ensemble/tests/test_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1377,3 +1377,22 @@ def test_little_tree_with_small_max_samples(ForestClass):

msg = "Tree without `max_samples` restriction should have more nodes"
assert tree1.node_count > tree2.node_count, msg


@pytest.mark.parametrize('Forest', FOREST_REGRESSORS)
def test_mse_criterion_object_segfault_smoke_test(Forest):
# Ensure that we can pass a mutable criterion while using parallel fit
# Non-regression test for:
# https://github.com/scikit-learn/scikit-learn/issues/12623
from sklearn.tree._classes import CRITERIA_REG

X = np.random.random((1000, 3))
y = np.random.random((1000, 1))

n_samples, n_outputs = y.shape
mse_criterion = CRITERIA_REG['mse'](n_outputs, n_samples)
est = FOREST_REGRESSORS[Forest](
n_estimators=2, n_jobs=2, criterion=mse_criterion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FOREST_REGRESSORS[Forest] knows that it is going to mutate a hyper-parameter such as criterion, I think est.fit should have the responsibility of copying criterion.

WDYT @glemaitre ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjpfan - I'm resuming this for Alex and was wondering what you think we should do next.

If I remember correctly, I think since the underlying issue is specific to a certain presumption of thread-safety when instantiating estimators we all settled on fixing it within _make_estimator(...) as suggested by the comments within that method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the deepcopy may be costly for certain objects and a more targeted solution would be to copy in BaseDecisionTree:

criterion = self.criterion
if not isinstance(criterion, Criterion):
if is_classification:
criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
self.n_classes_)
else:
criterion = CRITERIA_REG[self.criterion](self.n_outputs_,
n_samples)

if not isinstance(criterion, Criterion): 
    ...
else:
    criterion = copy.deepcopy(criterion)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work for us, although I'm not familiar enough with the rest of the ensemble implementation to know if it categorically covers all potential multi-threading issues.

I will update this fix and create a PR asap. Thanks again for looking into this.

)

est.fit(X, y)
0