8000 FEA Add metadata routing for RFE and RFECV by OmarManzoor · Pull Request #29312 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FEA Add metadata routing for RFE and RFECV #29312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Aug 13, 2024

Conversation

OmarManzoor
Copy link
Contributor

Reference Issues/PRs

Towards #22893
Fixes #7308

What does this implement/fix? Explain your changes.

  • Adds the routing of params to the fit methods of RFE and RFECV

Any other comments?

CC: @adrinjalali @glemaitre

Copy link
github-actions bot commented Jun 20, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2b860e2. Link to the linter CI: here

)
router.add(
scorer=self._get_scorer(),
method_mapping=MethodMapping().add(caller="fit", callee="score"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't score also call score?

Copy link
Contributor Author
@OmarManzoor OmarManzoor Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have a score method?
Edit: I see this inherits from RFE so it has a score method but is that applicable as the scorer isn't used in that. That calls the estimator score method. I added score and predict in the estimator part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit odd now. It's odd that we use the given scorer in fit, but not in score, that seems like something which needs to be fixed.

WDYT @glemaitre

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would consider this as a bug and fix it here and add an entry in the changelog stating that the estimator is now using the scoring metadata when calling score.

We had this pattern in LogisticRegressionCV and I assume that we considered it as a bug as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we need to add a score method which uses the scorer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep similarly to:

def score(self, X, y, sample_weight=None, **score_params):
"""Score using the `scoring` option on the given test data and labels.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Test samples.
y : array-like of shape (n_samples,)
True labels for X.
sample_weight : array-like of shape (n_samples,), default=None
Sample weights.
**score_params : dict
Parameters to pass to the `score` method of the underlying scorer.
.. versionadded:: 1.4
Returns
-------
score : float
Score of self.predict(X) w.r.t. y.
"""
_raise_for_params(score_params, self, "score")
scoring = self._get_scorer()
if _routing_enabled():
routed_params = process_routing(
self,
"score",
sample_weight=sample_weight,
**score_params,
)
else:
routed_params = Bunch()
routed_params.scorer = Bunch(score={})
if sample_weight is not None:
routed_params.scorer.score["sample_weight"] = sample_weight
return scoring(
self,
X,
y,
**routed_params.scorer.score,
8000 )

)
router.add(
scorer=self._get_scorer(),
method_mapping=MethodMapping().add(caller="fit", callee="score"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit odd now. It's odd that we use the given scorer in fit, but not in score, that seems like something which needs to be fixed.

WDYT @glemaitre

@glemaitre glemaitre self-requested a review July 22, 2024 12:33
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments. But it looks pretty good.

@@ -668,6 +758,7 @@ class RFECV(RFE):
"n_jobs": [None, Integral],
}
_parameter_constraints.pop("n_features_to_select")
__metadata_request__fit = {"groups": metadata_routing.UNUSED}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that we don't have a linter complaining here because it is unused. If nothing is complaining then this is fine, otherwise we can do:

Suggested change
__metadata_request__fit = {"groups": metadata_routing.UNUSED}
__metadata_request__fit = {"groups": metadata_routing.UNUSED} # noqa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do class attributes need to be used? Linters don't complain if we don't use a defined method of a class for instance.

)
router.add(
scorer=self._get_scorer(),
method_mapping=MethodMapping().add(caller="fit", callee="score"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would consider this as a bug and fix it here and add an entry in the changelog stating that the estimator is now using the scoring metadata when calling score.

We had this pattern in LogisticRegressionCV and I assume that we considered it as a bug as well.

return router

def _get_scorer(self):
return check_scoring(self.estimator, scoring=self.scoring)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safer to use get_scorer:

Suggested change
return check_scoring(self.estimator, scoring=self.scoring)
if self.scoring is None:
scoring = "accuracy" if is_classifier(self.estimator) else "r2"
else:
scoring = self.scoring
return get_scorer(scoring)

My issue with check_scoring is triggered when self.scoring is None. Indeed, we are going to delegate to the estimator.score method and thus we don't return a scorer.

I'm wondering if there is any side effect when returning the estimator.score that does not have the score_params in the signature. I assume that we should have the same error than when calling a scorer with an estimator that does not support it.

Copy link
Contributor Author
@OmarManzoor OmarManzoor Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah _PassThroughScorer breaks on 3 tests when scoring is None because of recursion depth exceeded meaning it gets stuck in an endless loop. The code you suggested seems fine, we just have to tweak the MockClassifier defined in the tests a bit to make the tests pass. Could you have a look at the latest changes to see if they look alright and then I can add the additional changelog.

.add(caller="score", callee="score"),
)
return router


class RFECV(RFE):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: We inherit from RFE and we create RFE object inside the class. I have the impression that this would be such an anti-pattern :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFECV is such an odd estimator. We also do things like

        rfe.fit(X, y)

        # Set final attributes
        ...
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(self._transform(X), y)

I'm like, WUT? How many times do we need to fit the same thing.

@adrinjalali
Copy link
Member

Tests failing here.

@OmarManzoor
Copy link
Contributor Author

Tests failing here.

Yes I need to check and fix 😄

@OmarManzoor
Copy link
Contributor Author
OmarManzoor commented Aug 1, 2024

@adrinjalali Is it possible to remove

DelegatorData("RFECV", RFECV, skip_methods=["transform", "inverse_transform"]),

from

test_metaestimator_delegation

Because of how we are handling the scoring for RFECV now this doesn't seem to work anymore even while fitting as the scoring is used in fit as well.

@adrinjalali
Copy link
Member

Looking at the stacktrace:

sklearn/metrics/_scorer.py:288: in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = make_scorer(r2_score, response_method='predict'), method_caller = functools.partial(<function _cached_call at 0x7eec9e9cdee0>, None), estimator = SubEstimator()
X = array([[-3.92679910e-02,  1.31911756e-01, -2.11205984e-01,
        -1.21414740e+00,  1.05004467e+00, -4.21327587e-01,
...-1.20373519e+00,
        -3.24730265e-01,  4.17043503e-01,  4.58931008e-02,
         1.34803578e+00,  4.98726958e-01]])
y_true = array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1]), kwargs = {}, pos_label = None
response_method = functools.partial(<function test_metaestimator_delegation.<locals>.SubEstimator.predict at 0x7eec9d8c6a20>, SubEstimator())

    def _score(self, method_caller, estimator, X, y_true, **kwargs):
        """Evaluate the response method of `estimator` on `X` and `y_true`.
    
        Parameters
        ----------
        method_caller : callable
            Returns predictions given an estimator, method name, and other
            arguments, potentially caching results.
    
        estimator : object
            Trained estimator to use for scoring.
    
        X : {array-like, sparse matrix}
            Test data that will be fed to clf.decision_function or
            clf.predict_proba.
    
        y_true : array-like
            Gold standard target values for X. These must be class labels,
            not decision function values.
    
        **kwargs : dict
            Other parameters passed to the scorer. Refer to
            :func:`set_score_request` for more details.
    
        Returns
        -------
        score : float
            Score function applied to prediction of estimator on X.
        """
        self._warn_overlap(
            message=(
                "There is an overlap between set kwargs of this scorer instance and"
                " passed metadata. Please pass them either as kwargs to `make_scorer`"
                " or metadata, but not both."
            ),
            kwargs=kwargs,
        )
    
        pos_label = None if is_regressor(estimator) else self._get_pos_label()
        response_method = _check_response_method(estimator, self._response_method)
        y_pred = method_caller(
>           estimator, response_method.__name__, X, pos_label=pos_label
        )
E       AttributeError: 'functools.partial' object has no attribute '__name__'. Did you mean: '__ne__'?

It looks like we're mishandling partial here, and we should check if it's partial, to use func maybe? like getattr(partial(f, 2), "func").__name__ if __name__ is absent and func is present? (and maybe recurse if it's a partial on a partial?)

.add(caller="score", callee="score"),
)
return router


class RFECV(RFE):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFECV is such an odd estimator. We also do things like

        rfe.fit(X, y)

        # Set final attributes
        ...
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(self._transform(X), y)

I'm like, WUT? How many times do we need to fit the same thing.

@@ -668,6 +758,7 @@ class RFECV(RFE):
"n_jobs": [None, Integral],
}
_parameter_constraints.pop("n_features_to_select")
__metadata_request__fit = {"groups": metadata_routing.UNUSED}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do class attributes need to be used? Linters don't complain if we don't use a defined method of a class for instance.

_raise_for_params(score_params, self, "score")
scoring = self._get_scorer()
if _routing_enabled():
routed_params = process_routing(self, "score", **score_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added score in the scorer_routing_methods in test_metaestimator_metadata_routing

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Maybe @adam2392 could have a look?

Copy link
Member
@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one part I just can't parse the diff!

@@ -55,7 +55,9 @@ def __init__(
skip_methods=["score"],
),
DelegatorData("RFE", RFE, skip_methods=["transform", "inverse_transform"]),
DelegatorData("RFECV", RFECV, skip_methods=["transform", "inverse_transform"]),
DelegatorData(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't part of your PR necessarily, but I'm unsure what DELEGATING_METAESTIMATORS purpose is, and can't easily figure it out from the glancing at the code.

Is it worth adding some comments somewhere above to indicate what's going on here?

cc: @adrinjalali

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is cryptic unless one reads the actual test. We're checking if these methods are only available if the sub-estimator has them, and since now we're implementing the score w/o caring about the sub-estimator, we can skip it here. But a comment somewhere wouldn't hurt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

@adrinjalali adrinjalali merged commit 38fefe1 into scikit-learn:main Aug 13, 2024
30 checks passed
@OmarManzoor OmarManzoor deleted the rfe_routing branch August 13, 2024 08:41
MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFE/RFECV doesn't work with sample weights
4 participants
0