-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX make deepcopy of parameters in ensemble to avoid sharing mutable instances #18985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eba4cf1
to
bb91401
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can wait for #18958 to be merged but then add an entry to the change log at doc/whats_new/v*.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
Sure, thanks for the review. I'll incorporate your suggestions now and then keep an eye on #18958. |
bb91401
to
8d07633
Compare
So you can add an entry in what's new since the PR creating the log for 1.0 was created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…concurrent accesses to the same Criterion instance
8d07633
to
97f7765
Compare
Changelog updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @aadamson !
n_samples, n_outputs = y.shape | ||
mse_criterion = CRITERIA_REG['mse'](n_outputs, n_samples) | ||
est = FOREST_REGRESSORS[Forest]( | ||
n_estimators=2, n_jobs=2, criterion=mse_criterion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If FOREST_REGRESSORS[Forest]
knows that it is going to mutate a hyper-parameter such as criterion
, I think est.fit
should have the responsibility of copying criterion
.
WDYT @glemaitre ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thomasjpfan - I'm resuming this for Alex and was wondering what you think we should do next.
If I remember correctly, I think since the underlying issue is specific to a certain presumption of thread-safety when instantiating estimators we all settled on fixing it within _make_estimator(...)
as suggested by the comments within that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the deepcopy may be costly for certain objects and a more targeted solution would be to copy in BaseDecisionTree
:
scikit-learn/sklearn/tree/_classes.py
Lines 344 to 351 in 94abe05
criterion = self.criterion | |
if not isinstance(criterion, Criterion): | |
if is_classification: | |
criterion = CRITERIA_CLF[self.criterion](self.n_outputs_, | |
self.n_classes_) | |
else: | |
criterion = CRITERIA_REG[self.criterion](self.n_outputs_, | |
n_samples) |
if not isinstance(criterion, Criterion):
...
else:
criterion = copy.deepcopy(criterion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work for us, although I'm not familiar enough with the rest of the ensemble implementation to know if it categorically covers all potential multi-threading issues.
I will update this fix and create a PR asap. Thanks again for looking into this.
Closing since this is resolved in #19580 |
Reference Issues/PRs
Fixes #12623
Adds a smoke test as requested in #12690
What does this implement/fix? Explain your changes.
The issue is that when trees are created by the base ensemble class, the parameters are not copied. This is problematic in the case of the criterion parameter if a Criterion object is passed in. When this happens, all the trees share the same object and mutate it. If n_jobs > 1 (or -1, and we have more than one core available), they are mutating the same object concurrently.
Any other comments?
The functional changes of this PR are the same as those in #12690. I'm just adding a test that uses an instance of a Criterion object with
n_jobs=-1
to validate that concurrent access to a Criterion instance no longer causes a segfault.