FIX `TransformedTargetRegressor` warns when `set_output` expects dataframe #29401

StefanieSenger · 2024-07-03T13:00:00Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Prevents warnings on output type to be raised on TransformedTargetRegressor().fit() when global output is set to "pandas" or "polars".

Any other comments?

Did this together with @lesteve: we figured that TransformedTargetRegressor is in fact not a transformer and therefore there should not be a warning, neither would we expect that X is returned as a DataFrame, nor do we need to raise an exception.
Setting set_config(transform_output="pandas") should have no effect within 10000 TransformedTargetRegressor().fit().

I have also checked other places where FunctionTransformer is used, but these occurances all refer to transformers. So there should be no other places where we would need to apply the same fix.

In the second commit, I also suggest a documentation change to explain better the difference between set_output(transform=None) and set_output(transform="default") (see this comment).

github-actions · 2024-07-03T13:01:19Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 792bd4a. Link to the linter CI: here}

lesteve

Thanks for the PR, a few comments.

lesteve · 2024-07-03T14:31:32Z

sklearn/compose/_column_transformer.py

@@ -350,7 +350,7 @@ def set_output(self, *, transform=None):
            - `"default"`: Default output format of a transformer
            - `"pandas"`: DataFrame output
            - `"polars"`: Polars output
-            - `None`: Transform configuration is unchanged
+            - `None`: Transform configuration is unchanged, global configuration is used


Actually it is not necessarily using the global configuration, you could imagine something like this:

estimator.set_output(transform='pandas') estimator.set_output(transform=None)

The global configuration is transform='default' but the estimator is using transform='pandas'

More a meta comment: I get the fact that it may not be very clear from the docstring what transform=None does but I don't think there is an easy way to improve the situation in the docstring. Full disclosure: I am not sure about the use case for transform=None actually 🤔.

Oh, I was thinking that we cannot actively set transform=None, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None.

Also, I have just discovered, there is

supported_outputs = ADAPTERS_MANAGER.supported_outputs if dense_config not in supported_outputs: raise ValueError( f"output config must be in {sorted(supported_outputs)}, got {dense_config}" )

to prevent us from setting transform=None.

It can only stay None if it is untouched, and then the global configuration is used within the _get_output_config() function.

I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.

transform : {"default", "pandas", "polars"}, default=None

I guess the docstring is slightly misleading then. It seems like you can use transform=None:

from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler X, y = load_iris(as_frame=True, return_X_y=True) scaler = StandardScaler() scaler.set_output(transform="pandas") scaler.set_output(transform=None) scaler.fit(X) # this outputs a pandas DataFrame so transform=None did not change transform="pandas" X_scaled = scaler.transform(X)

None is used as a sentinel. In the future, f set_output supported predict, then the default transform=None will leave transform configuration alone:

scaler.set_output(predict="pandas")

Your example got me really confused @lesteve , because I had tried a similar experiment with FunctionTransformer, that had led me to think that we cannot pass transform=None:

from sklearn.datasets import load_iris from sklearn.preprocessing import FunctionTransformer #set_config(transform_output="pandas") # same as using set_output before transform=None X, y = load_iris(as_frame=True, return_X_y=True) functrans = FunctionTransformer() functrans.set_output(transform="pandas") functrans.set_output(transform=None) functrans.fit(X) # this raises a ValueError: functrans.transform(X)

ValueError: output config must be in ['default', 'pandas', 'polars'], got None

It seems that FunctionTransformer().set_output() and _SetOutputMixin.set_output() (from StandardScaler ) are handeling this differently. The latter has an early if transform is None: return self, while the FunctionTransformer can pass self._sklearn_output_config = {transform: None} to _get_output_config(), which would then raise the above error message.

I am not sure what to make of this, though...
Is this wanted/expected or should we allign this?

I think I should revert the documentation change, since transform=None is sometimes possible to pass and then the previous setting is used, not necessarily the global configuration.

OK well, I think we should fix the bug first and leave the docstring modification for another PR when we understand a bit more the .set_output thing.

Yes, I agree. I will revert the documentation changes then.

lesteve · 2024-07-03T14:35:25Z

sklearn/compose/tests/test_target.py

@@ -393,3 +393,17 @@ def test_transform_target_regressor_pass_extra_predict_parameters():
    regr.fit(X, y)
    regr.predict(X, check_input=False)
    assert regr.regressor_.predict_called
+
+
+@pytest.mark.filterwarnings("error")


Did you find this pattern used in other places in our test? Personally I would maybe find it a bit less clear than the recommendation from Pytest for checking that no warnings is emitted:

with warnings.catch_warnings(): warnings.simplefilter("error") ...

If I had to give a small justification for it, I would say that the code checking that there is no warning is closer to code potentially creating the warning which maybe helps making the intent a bit clearer ...

Yes, I see what you mean. @pytest.mark.filterwarnings("error") is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.

Let's better use warnings.catch_warnings instead.

sklearn/compose/_target.py

StefanieSenger

Thank you @lesteve for reviewing. I have adopted the suggested changes and answered to the documentation change. Would you have another look?

StefanieSenger · 2024-07-04T09:02:47Z

sklearn/compose/tests/test_target.py

@@ -393,3 +393,17 @@ def test_transform_target_regressor_pass_extra_predict_parameters():
    regr.fit(X, y)
    regr.predict(X, check_input=False)
    assert regr.regressor_.predict_called
+
+
+@pytest.mark.filterwarnings("error")


Yes, I see what you mean. @pytest.mark.filterwarnings("error") is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.

Let's better use warnings.catch_warnings instead.

StefanieSenger · 2024-07-04T09:20:29Z

sklearn/compose/_column_transformer.py

@@ -350,7 +350,7 @@ def set_output(self, *, transform=None):
            - `"default"`: Default output format of a transformer
            - `"pandas"`: DataFrame output
            - `"polars"`: Polars output
-            - `None`: Transform configuration is unchanged
+            - `None`: Transform configuration is unchanged, global configuration is used


Oh, I was thinking that we cannot actively set transform=None, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None.

Also, I have just discovered, there is

supported_outputs = ADAPTERS_MANAGER.supported_outputs if dense_config not in supported_outputs: raise ValueError( f"output config must be in {sorted(supported_outputs)}, got {dense_config}" )

to prevent us from setting transform=None.

It can only stay None if it is untouched, and then the global configuration is used within the _get_output_config() function.

I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.

StefanieSenger · 2024-07-05T13:35:13Z

I have just reverted the documentation change on _column_transformer.py, which made our conversation there almost un-recoverable.

Here's a link to it, in case we want to get back to it later.

lesteve

LGTM, thanks!

Looks like the check changelog thing now complains (no idea why ...), I have added a "no changelog required" because I think removing a spurious warning is certainly a good thing but probably does not belong in the changelog

thomasjpfan

I feel like this should be backported to 1.5.2. I see that this PR is marked "no changelog required", but can we add one for 1.5.2?

StefanieSenger · 2024-07-05T20:29:43Z

I feel like this should be backported to 1.5.2. I see that this PR is marked "no changelog required", but can we add one for 1.5.2?

Thanks @thomasjpfan,
I will put the changelog entry into a new 1.5.2 section in the v1.5rst file.

lesteve · 2024-07-06T05:19:51Z

@thomasjpfan do you have inside knowledge that there will be a 1.5.2 🤔?

I was unsure about the changelog for this one to be honest it's nice way to recognise the contributor work, but at the same time nobody is going to be like "weird I don't get this spurious warning anymore, let's look at the changelog to see whether they changed something" ...

lorentzenchr · 2024-07-06T08:47:21Z

@StefanieSenger Thanks for this fix!

…frame (scikit-learn#29401)

…frame (#29401)

fix and regression test

807dc87

github-actions bot added the module:compose label Jul 3, 2024

clearer documentation for args None and 'default'

819e167

StefanieSenger changed the title ~~fix and regression test~~ FIX TransformedTargetRegressor warns when set_output expects dataframe Jul 3, 2024

lesteve reviewed Jul 3, 2024

View reviewed changes

changes after review

774881c

StefanieSenger commented Jul 4, 2024

View reviewed changes

revert documentation change

286bf4e

lesteve approved these changes Jul 5, 2024

View reviewed changes

thomasjpfan reviewed Jul 5, 2024

View reviewed changes

add changelog entry

792bd4a

thomasjpfan added this to the 1.5.2 milestone Jul 5, 2024

thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jul 5, 2024

thomasjpfan approved these changes Jul 5, 2024

View reviewed changes

thomasjpfan enabled auto-merge (squash) July 5, 2024 21:30

thomasjpfan merged commit 938c36b into scikit-learn:main Jul 5, 2024
29 checks passed

StefanieSenger deleted the new_branch branch July 6, 2024 04:41

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 9, 2024

FIX TransformedTargetRegressor warns when set_output expects data…

9e2d53a

…frame (scikit-learn#29401)

glemaitre pushed a commit that referenced this pull request Sep 11, 2024

FIX TransformedTargetRegressor warns when set_output expects data…

56be293

…frame (#29401)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX `TransformedTargetRegressor` warns when `set_output` expects dataframe #29401

FIX `TransformedTargetRegressor` warns when `set_output` expects dataframe #29401

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FIX TransformedTargetRegressor warns when set_output expects dataframe #29401

FIX TransformedTargetRegressor warns when set_output expects dataframe #29401

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FIX `TransformedTargetRegressor` warns when `set_output` expects dataframe #29401

FIX `TransformedTargetRegressor` warns when `set_output` expects dataframe #29401