8000 FIX `TransformedTargetRegressor` warns when `set_output` expects dataframe by StefanieSenger · Pull Request #29401 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX TransformedTargetRegressor warns when set_output expects dataframe #29401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 5, 2024

Conversation

StefanieSenger
Copy link
Contributor
@StefanieSenger StefanieSenger commented Jul 3, 2024

Reference Issues/PRs

closes #29361

What does this implement/fix? Explain your changes.

Prevents warnings on output type to be raised on TransformedTargetRegressor().fit() when global output is set to "pandas" or "polars".

Any other comments?

Did this together with @lesteve: we figured that TransformedTargetRegressor is in fact not a transformer and therefore there should not be a warning, neither would we expect that X is returned as a DataFrame, nor do we need to raise an exception.
Setting set_config(transform_output="pandas") should have no effect within 10000 TransformedTargetRegressor().fit().

I have also checked other places where FunctionTransformer is used, but these occurances all refer to transformers. So there should be no other places where we would need to apply the same fix.

In the second commit, I also suggest a documentation change to explain better the difference between set_output(transform=None) and set_output(transform="default") (see this comment).

Copy link
github-actions bot commented Jul 3, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 792bd4a. Link to the linter CI: here

@StefanieSenger StefanieSenger changed the title fix and regression test FIX TransformedTargetRegressor warns when set_output expects dataframe Jul 3, 2024
Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, a few comments.

@@ -350,7 +350,7 @@ def set_output(self, *, transform=None):
- `"default"`: Default output format of a transformer
- `"pandas"`: DataFrame output
- `"polars"`: Polars output
- `None`: Transform configuration is unchanged
- `None`: Transform configuration is unchanged, global configuration is used
Copy link
Member
@lesteve lesteve Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is not necessarily using the global configuration, you could imagine something like this:

estimator.set_output(transform='pandas')
estimator.set_output(transform=None)

The global configuration is transform='default' but the estimator is using transform='pandas'

More a meta comment: I get the fact that it may not be very clear from the docstring what transform=None does but I don't think there is an easy way to improve the situation in the docstring. Full disclosure: I am not sure about the use case for transform=None actually 🤔.

Copy link
Contributor Author
@StefanieSenger StefanieSenger Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was thinking that we cannot actively set transform=None, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None.

Also, I have just discovered, there is

    supported_outputs = ADAPTERS_MANAGER.supported_outputs
    if dense_config not in supported_outputs:
        raise ValueError(
            f"output config must be in {sorted(supported_outputs)}, got {dense_config}"
        )

to prevent us from setting transform=None.

It can only stay None if it is untouched, and then the global configuration is used within the _get_output_config() function.

I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform : {"default", "pandas", "polars"}, default=None

I guess the docstring is slightly misleading then. It seems like you can use transform=None:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

X, y = load_iris(as_frame=True, return_X_y=True)

scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.set_output(transform=None)

scaler.fit(X)
# this outputs a pandas DataFrame so transform=None did not change transform="pandas"
X_scaled = scaler.transform(X)

Copy link
Member
@thomasjpfan thomasjpfan Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is used as a sentinel. In the future, f set_output supported predict, then the default transform=None will leave transform configuration alone:

scaler.set_output(predict="pandas") 

Copy link
Contributor Author
@StefanieSenger StefanieSenger Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example got me really confused @lesteve , because I had tried a similar experiment with FunctionTransformer, that had led me to think that we cannot pass transform=None:

from sklearn.datasets import load_iris
from sklearn.preprocessing import FunctionTransformer

#set_config(transform_output="pandas")  # same as using set_output before transform=None 
X, y = load_iris(as_frame=True, return_X_y=True)

functrans = FunctionTransformer()
functrans.set_output(transform="pandas")
functrans.set_output(transform=None)

functrans.fit(X)
# this raises a ValueError:
functrans.transform(X)

ValueError: output config must be in ['default', 'pandas', 'polars'], got None

It seems that FunctionTransformer().set_output() and _SetOutputMixin.set_output() (from StandardScaler ) are handeling this differently. The latter has an early if transform is None: return self, while the FunctionTransformer can pass self._sklearn_output_config = {transform: None} to _get_output_config(), which would then raise the above error message.

I am not sure what to make of this, though...
Is this wanted/expected or should we allign this?

I think I should revert the documentation change, since transform=None is sometimes possible to pass and then the previous setting is used, not necessarily the global configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK well, I think we should fix the bug first and leave the docstring modification for another PR when we understand a bit more the .set_output thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I will revert the documentation changes then.

@@ -393,3 +393,17 @@ def test_transform_target_regressor_pass_extra_predict_parameters():
regr.fit(X, y)
regr.predict(X, check_input=False)
assert regr.regressor_.predict_called


@pytest.mark.filterwarnings("error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find this pattern used in other places in our test? Personally I would maybe find it a bit less clear than the recommendation from Pytest for checking that no warnings is emitted:

with warnings.catch_warnings():
    warnings.simplefilter("error")
    ...

If I had to give a small justification for it, I would say that the code checking that there is no warning is closer to code potentially creating the warning which maybe helps making the intent a bit clearer ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see what you mean. @pytest.mark.filterwarnings("error") is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.

Let's better use warnings.catch_warnings instead.

Copy link
Contributor Author
@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lesteve for reviewing. I have adopted the suggested changes and answered to the documentation change. Would you have another look?

@@ -393,3 +393,17 @@ def test_transform_target_regressor_pass_extra_predict_parameters():
regr.fit(X, y)
regr.predict(X, check_input=False)
assert regr.regressor_.predict_called


@pytest.mark.filterwarnings("error")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see what you mean. @pytest.mark.filterwarnings("error") is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.

Let's better use warnings.catch_warnings instead.

@@ -350,7 +350,7 @@ def set_output(self, *, transform=None):
- `"default"`: Default output format of a transformer
- `"pandas"`: DataFrame output
- `"polars"`: Polars output
- `None`: Transform configuration is unchanged
- `None`: Transform configuration is unchanged, global configuration is used
Copy link
Contributor Author
@StefanieSenger StefanieSenger Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was thinking that we cannot actively set transform=None, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None.

Also, I have just discovered, there is

    supported_outputs = ADAPTERS_MANAGER.supported_outputs
    if dense_config not in supported_outputs:
        raise ValueError(
            f"output config must be in {sorted(supported_outputs)}, got {dense_config}"
        )

to prevent us from setting transform=None.

It can only stay None if it is untouched, and then the global configuration is used within the _get_output_config() function.

I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.

@StefanieSenger
Copy link
Contributor Author
StefanieSenger 57A6 commented Jul 5, 2024

I have just reverted the documentation change on _column_transformer.py, which made our conversation there almost un-recoverable.

Here's a link to it, in case we want to get back to it later.

Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Looks like the check changelog thing now complains (no idea why ...), I have added a "no changelog required" because I think removing a spurious warning is certainly a good thing but probably does not belong in the changelog

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be backported to 1.5.2. I see that this PR is marked "no changelog required", but can we add one for 1.5.2?

@StefanieSenger
Copy link
Contributor Author
StefanieSenger commented Jul 5, 2024

I feel like this should be backported to 1.5.2. I see that this PR is marked "no changelog required", but can we add one for 1.5.2?

Thanks @thomasjpfan,
I will put the changelog entry into a new 1.5.2 section in the v1.5rst file.

@thomasjpfan thomasjpfan added this to the 1.5.2 milestone Jul 5, 2024
@thomasjpfan thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jul 5, 2024
@thomasjpfan thomasjpfan enabled auto-merge (squash) July 5, 2024 21:30
@thomasjpfan thomasjpfan merged commit 938c36b into scikit-learn:main Jul 5, 2024
29 checks passed
@StefanieSenger StefanieSenger deleted the new_branch branch July 6, 2024 04:41
@lesteve
Copy link
Member
lesteve commented Jul 6, 2024

@thomasjpfan do you have inside knowledge that there will be a 1.5.2 🤔?

I was unsure about the changelog for this one to be honest it's nice way to recognise the contributor work, but at the same time nobody is going to be like "weird I don't get this spurious warning anymore, let's look at the changelog to see whether they changed something" ...

@lorentzenchr
Copy link
Member

@StefanieSenger Thanks for this fix!

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:compose No Changelog Needed To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TransformedTargetRegressor warns about set_output set to pandas
4 participants
0