8000 FIX make deepcopy of parameters in ensemble to avoid sharing mutable instances by aadamson · Pull Request #18985 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX make deepcopy of parameters in ensemble to avoid sharing mutable instances #18985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

aadamson
Copy link
@aadamson aadamson commented Dec 9, 2020

Reference Issues/PRs

Fixes #12623
Adds a smoke test as requested in #12690

What does this implement/fix? Explain your changes.

The issue is that when trees are created by the base ensemble class, the parameters are not copied. This is problematic in the case of the criterion parameter if a Criterion object is passed in. When this happens, all the trees share the same object and mutate it. If n_jobs > 1 (or -1, and we have more than one core available), they are mutating the same object concurrently.

Any other comments?

The functional changes of this PR are the same as those in #12690. I'm just adding a test that uses an instance of a Criterion object with n_jobs=-1 to validate that concurrent access to a Criterion instance no longer causes a segfault.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can wait for #18958 to be merged but then add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@aadamson
Copy link
Author

Sure, thanks for the review. I'll incorporate your suggestions now and then keep an eye on #18958.

@glemaitre
Copy link
Member

So you can add an entry in what's new since the PR creating the log for 1.0 was created.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre changed the title Copy _make_estimator parameters FIX make deepcopy of parameters in ensemble to avoid sharing mutable instances Dec 15, 2020
…concurrent accesses to the same Criterion instance
@aadamson
Copy link
Author

Changelog updated

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @aadamson !

n_samples, n_outputs = y.shape
mse_criterion = CRITERIA_REG['mse'](n_outputs, n_samples)
est = FOREST_REGRESSORS[Forest](
n_estimators=2, n_jobs=2, criterion=mse_criterion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FOREST_REGRESSORS[Forest] knows that it is going to mutate a hyper-parameter such as criterion, I think est.fit should have the responsibility of copying criterion.

WDYT @glemaitre ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjpfan - I'm resuming this for Alex and was wondering what you think we should do next.

If I remember correctly, I think since the underlying issue is specific to a certain presumption of thread-safety when instantiating estimators we all settled on fixing it within _make_estimator(...) as suggested by the comments within that method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the deepcopy may be costly for certain objects and a more targeted solution would be to copy in BaseDecisionTree:

criterion = self.criterion
if not isinstance(criterion, Criterion):
if is_classification:
criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
self.n_classes_)
else:
criterion = CRITERIA_REG[self.criterion](self.n_outputs_,
n_samples)

if not isinstance(criterion, Criterion): 
    ...
else:
    criterion = copy.deepcopy(criterion)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work for us, although I'm not familiar enough with the rest of the ensemble implementation to know if it categorically covers all potential multi-threading issues.

I will update this fix and create a PR asap. Thanks again for looking into this.

Base automatically changed from master to main January 22, 2021 10:53
@thomasjpfan
Copy link
Member

Closing since this is resolved in #19580

@thomasjpfan thomasjpfan closed this Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault when passing Criterion object to Forest ensembles with n_jobs>1
4 participants
0