ENH Allow prefit in stacking #22215

Micky774 · 2022-01-14T23:34:34Z

Reference Issues/PRs

Fixes #16556
Closes #16748
Continuation of stalled PR #16748

What does this implement/fix? Explain your changes.

(PR #16748):
Added support to use pre-fit model in StackingClassifier and StackingRegressor

Similar to CalibratedClassifierCV, I added the option to make cv = "prefit" to use fitted estimators into a stacking model.

(This PR)
Resolves remaining PR concerns, mainly regarding testing.

Any other comments?

Continuation of stalled PR #16748

…utureWarning

… to whats_new to v0.24

- Updated `sklearn/ensemble/tests/test_stacking` to incorporate suggested changes from PR#16748

thomasjpfan

Thanks for the PR @Micky774! Overall looks good.

We can add a sentence to:

scikit-learn/sklearn/ensemble/_stacking.py

Line 365 in 1d1aadd

estimators_ : list of estimators

that says what happens when cv="prefit"

sklearn/ensemble/_stacking.py

doc/whats_new/v1.0.rst

sklearn/ensemble/_stacking.py

Micky774 · 2022-01-16T02:53:23Z

Implemented changes, and made the documentation more consistent between StackingClassifier and StackingRegressor

thomasjpfan

Thanks for the update!

sklearn/ensemble/_stacking.py

…cking

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan · 2022-01-25T21:03:00Z

@glemaitre Would you interested in reviewing this PR?

glemaitre · 2022-01-26T09:20:32Z

Yep let me put it in my stack

thomasjpfan · 2022-02-02T22:14:09Z

@jnothman Maybe you would be interested in reviewing this one? This is mostly the same as #16748 which already had your approval.

glemaitre

The code seems fine. I have just this question regarding the API that I am not sure about.

glemaitre · 2022-02-10T14:54:50Z

sklearn/ensemble/_stacking.py

+
+        if self.cv == "prefit":
+            # Generate predictions from prefit models
+            predictions = [


@thomasjpfan do you think that we could benefit from parallelization over the models here?

I think there can be a benefit, but this can be done in a follow up PR with some benchmarks.

sklearn/ensemble/_stacking.py

glemaitre · 2022-02-10T14:58:27Z

sklearn/ensemble/_stacking.py

@@ -306,7 +321,7 @@ class StackingClassifier(ClassifierMixin, _BaseStacking):
        The default classifier is a
        :class:`~sklearn.linear_model.LogisticRegression`.

-    cv : int, cross-validation generator or an iterable, default=None
+    cv : int, cross-validation generator, iterable, or 'prefit', default=None


In the description above, we have a note regarding the fit of estimators_ on the full training set. It should be complemented with the option prefit (at least stating that we don't refit if cv="prefit").

There's a description of the role of cv="prefit" in both the description of the cv parameter, as well as estimators_. I'm not sure I quite understand what you're suggesting here.

sklearn/ensemble/_stacking.py

glemaitre · 2022-02-10T15:10:39Z

sklearn/ensemble/_stacking.py

+            for estimator in all_estimators:
+                if estimator != "drop":
+                    check_is_fitted(estimator)
+                    self.estimators_.append(estimator)


I am unsure about our API contract here. If someone alters an estimator from all_estimators without calling fit of the stacking model, it will however have an effect on the prediction.

Thinking about the freezing API, changing a hyperparameter and calling fit will have no effect. So the deep copy is not necessary to prevent the behaviour that I described above. However, without this freezing API here, we would need to make a deep copy of each estimator to prevent any side effects.

@thomasjpfan WDYT? I see that we have a similar pattern in calibration.

Here is an example to explicitly illustrate what I mean:

from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.svm import LinearSVC from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.ensemble import StackingClassifier from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, random_state=42 ) X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split( X_train, y_train, stratify=y_train, random_state=0, test_size=0.5, ) X, y = load_iris(return_X_y=True) estimators = [ ('rf', RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train_1, y_train_1)), ('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42)).fit(X_train_1, y_train_1)) ] clf = StackingClassifier( estimators=estimators, final_estimator=LogisticRegression(), cv="prefit" ) clf.fit(X_train, y_train) print(f"Accuracy score: {clf.score(X_test, y_test):.2f}") estimators[0][1].fit(X_train_2, y_train_2) print( "Accuracy score after refitting an estimator: " f"{clf.score(X_test, y_test):.2f}" )

Accuracy score: 0.89 Accuracy score after refitting an estimator: 0.95

I think this is the norm for any object we pass into __init__ that is mutable such as dictionaries. We have not been semantic about deep copying because it comes with overhead.

As we discussed IRL, some may see your snippet as a feature as it behaves like Pipeline.

Logistically, given the status of freezing waiting for a freezing API meant his feature will be delayed quite a bit. I think we decided in the 2019 sprint that we were going with option 4: #8370 (comment) which means no freezing API. But it could be worth revisiting now.

Basically, we already have an issue with CalibratedClassifierCV. So I assume that we will be in trouble in the future but we might want to solve those as a whole.

sklearn/ensemble/_stacking.py

sklearn/ensemble/tests/test_stacking.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…4/scikit-learn into allow_prefit_in_stacking

glemaitre · 2022-02-14T11:09:22Z

The PR looks good. We only need to make a decision regarding the API and to know if we do a deep copy or not.
I will add a tag such that we do a decision.

glemaitre

+1 then. @Micky774 Could you resolve the conflict such that we can merge.

glemaitre · 2022-02-16T10:00:23Z

doc/whats_new/v1.1.rst

@@ -311,6 +311,9 @@ Changelog
 - |Enhancement| Adds support to use pre-fit models with `cv="prefit"`
  in :class:`ensemble.StackingClassifier` and :class:`ensemble.StackingRegressor`.
  :pr:`16748` by :user:`Siqi He <siqi-he>` and :pr:`22215` by
+- |Enhancement| :class:`feature_selection.GenericUnivariateSelect` preserves


It seems that there is an issue with the merging here.

I've been seeing more weird merge issues lately in the changelog in other PRs. It could be related to #21516

I'm not sure I understand what the problem is -- everything looks fine on my end?

Looks fine now. Maybe the GitHub interface was doing something weird with the diff.

It was missing your username: 9d6c27a

Usually, it happens that GitHub complains about merge conflict. In this case, merging locally will not show any conflicts but git is actually messing up the merging as it did for this one. Basically, GitHub is right but I don't know why git is resolving this merge conflict on its own.

It only happens with the changelog generally.

Micky774 · 2022-02-22T18:12:18Z

Just wanted to ping again to see if this was ready for merge -- it has two approvals and should be caught-up to main

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Siqi He <siqi.he@upstart.com>

siqi-he added 14 commits March 20, 2020 01:56

initial commit for supporting prefit model in stacking model

05e1174

added tests for stacking

c6db994

reformat code for PEP8

a020873

added random state for prefit data generation

ade5745

moved _predict_single_estimator to local and removed parallel prediction

c944475

made test_stacking_prefit_error test for NotFittedError only

932a6ee

updated SumRegressor and ProdClassifier attribute naming to prevent F…

9c1abb5

…utureWarning

updated docstring for cv='prefit'

21061dc

addressed comment to remove deepcopy, update test_stacking, and added…

b8d30c2

… to whats_new to v0.24

reformat v0.24 for consistency with master

e623495

updated tests for prefit models

8acba81

Merge branch 'main' into allow_prefit_in_stacking

c7e74e7

add warning about overfitting using prefit model in docstring

113baf5

update whats_new v1.0

ca62e47

github-actions bot added the module:ensemble label Jan 14, 2022

Micky774 added 3 commits January 14, 2022 18:42

Merge branch 'main' into allow_prefit_in_stacking

1a3d464

Skipped dropped estimators in fit

c7ca2af

Implemented recommended changes

0415257

- Updated `sklearn/ensemble/tests/test_stacking` to incorporate suggested changes from PR#16748

Micky774 changed the title ~~[WIP] ENH Allow prefit in stacking~~ ENH Allow prefit in stacking Jan 15, 2022

thomasjpfan reviewed Jan 15, 2022

View reviewed changes

Micky774 added 2 commits January 15, 2022 21:44

Addressed review comments for PR#22215

0512a2a

Made documentation more consistent in sklearn/ensemble/_stacking

4428397

thomasjpfan reviewed Jan 18, 2022

View reviewed changes

Micky774 and others added 6 commits January 18, 2022 14:47

Merge remote-tracking branch 'upstream/main' into allow_prefit_in_sta…

962c13b

…cking

Update sklearn/ensemble/_stacking.py

353126a

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/ensemble/_stacking.py

dbaa338

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/ensemble/_stacking.py

f730f1a

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/ensemble/_stacking.py

335ab0f

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/ensemble/_stacking.py

95690cb

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Micky774 and others added 4 commits January 22, 2022 12:34

Update sklearn/ensemble/_stacking.py

65bd4e4

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Updated StackingClassifier docs

f2af199

Merge branch 'main' into allow_prefit_in_stacking

09e6667

Updated whats_new format

a0b349a

glemaitre self-requested a review January 26, 2022 09:20

Merge branch 'main' into allow_prefit_in_stacking

286b964

glemaitre reviewed Feb 10, 2022

View reviewed changes

Micky774 and others added 6 commits February 12, 2022 14:12

Apply suggestions from code review

9d4332b

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Removed unnecessary testing parameters and formatted docstring

a1358e6

Merge branch 'main' into allow_prefit_in_stacking

c24377b

Merge branch 'main' into allow_prefit_in_stacking

44da5c8

Merge branch 'allow_prefit_in_stacking' of https://github.com/Micky77…

08f3877

…4/scikit-learn into allow_prefit_in_stacking

Fixed docstring bullet point list format

0ead23f

glemaitre added the Needs Decision - API label Feb 14, 2022

glemaitre approved these changes Feb 15, 2022

View reviewed changes

Merge branch 'main' into allow_prefit_in_stacking

a8cbd93

glemaitre reviewed Feb 16, 2022

View reviewed changes

Micky774 and others added 4 commits February 16, 2022 13:18

Merge branch 'main' into allow_prefit_in_stacking

9c32f3e

Added newline separating changelog entry

eaaaba4

Update v1.1.rst

9d6c27a

Merge branch 'main' into allow_prefit_in_stacking

ccef6d1

thomasjpfan merged commit 691972a into scikit-learn:main Feb 22, 2022

Micky774 deleted the allow_prefit_in_stacking branch February 22, 2022 19:22

NartayAikyn mentioned this pull request Apr 1, 2022

Add Pre-fitted Model to VotingClassifier #23018

Closed

Uh oh!

ENH Allow prefit in stacking #22215

ENH Allow prefit in stacking #22215

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!