-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX TransformedTargetRegressor
warns when set_output
expects dataframe
#29401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
TransformedTargetRegressor
warns when set_output
expects dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, a few comments.
@@ -350,7 +350,7 @@ def set_output(self, *, transform=None): | |||
- `"default"`: Default output format of a transformer | |||
- `"pandas"`: DataFrame output | |||
- `"polars"`: Polars output | |||
- `None`: Transform configuration is unchanged | |||
- `None`: Transform configuration is unchanged, global configuration is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is not necessarily using the global configuration, you could imagine something like this:
estimator.set_output(transform='pandas')
estimator.set_output(transform=None)
The global configuration is transform='default'
but the estimator is using transform='pandas'
More a meta comment: I get the fact that it may not be very clear from the docstring what transform=None
does but I don't think there is an easy way to improve the situation in the docstring. Full disclosure: I am not sure about the use case for transform=None
actually 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was thinking that we cannot actively set transform=None
, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None
.
Also, I have just discovered, there is
supported_outputs = ADAPTERS_MANAGER.supported_outputs
if dense_config not in supported_outputs:
raise ValueError(
f"output config must be in {sorted(supported_outputs)}, got {dense_config}"
)
to prevent us from setting transform=None
.
It can only stay None
if it is untouched, and then the global configuration is used within the _get_output_config()
function.
I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform : {"default", "pandas", "polars"}, default=None
I guess the docstring is slightly misleading then. It seems like you can use transform=None
:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
X, y = load_iris(as_frame=True, return_X_y=True)
scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.set_output(transform=None)
scaler.fit(X)
# this outputs a pandas DataFrame so transform=None did not change transform="pandas"
X_scaled = scaler.transform(X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is used as a sentinel. In the future, f set_output
supported predict
, then the default transform=None
will leave transform configuration alone:
scaler.set_output(predict="pandas")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example got me really confused @lesteve , because I had tried a similar experiment with FunctionTransformer
, that had led me to think that we cannot pass transform=None
:
from sklearn.datasets import load_iris
from sklearn.preprocessing import FunctionTransformer
#set_config(transform_output="pandas") # same as using set_output before transform=None
X, y = load_iris(as_frame=True, return_X_y=True)
functrans = FunctionTransformer()
functrans.set_output(transform="pandas")
functrans.set_output(transform=None)
functrans.fit(X)
# this raises a ValueError:
functrans.transform(X)
ValueError: output config must be in ['default', 'pandas', 'polars'], got None
It seems that FunctionTransformer().set_output()
and _SetOutputMixin.set_output()
(from StandardScaler
) are handeling this differently. The latter has an early if transform is None: return self
, while the FunctionTransformer
can pass self._sklearn_output_config = {transform: None} to _get_output_config()
, which would then raise the above error message.
I am not sure what to make of this, though...
Is this wanted/expected or should we allign this?
I think I should revert the documentation change, since transform=None
is sometimes possible to pass and then the previous setting is used, not necessarily the global configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK well, I think we should fix the bug first and leave the docstring modification for another PR when we understand a bit more the .set_output
thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I will revert the documentation changes then.
sklearn/compose/tests/test_target.py
Outdated
@@ -393,3 +393,17 @@ def test_transform_target_regressor_pass_extra_predict_parameters(): | |||
regr.fit(X, y) | |||
regr.predict(X, check_input=False) | |||
assert regr.regressor_.predict_called | |||
|
|||
|
|||
@pytest.mark.filterwarnings("error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find this pattern used in other places in our test? Personally I would maybe find it a bit less clear than the recommendation from Pytest for checking that no warnings is emitted:
with warnings.catch_warnings():
warnings.simplefilter("error")
...
If I had to give a small justification for it, I would say that the code checking that there is no warning is closer to code potentially creating the warning which maybe helps making the intent a bit clearer ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see what you mean. @pytest.mark.filterwarnings("error")
is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.
Let's better use warnings.catch_warnings
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @lesteve for reviewing. I have adopted the suggested changes and answered to the documentation change. Would you have another look?
sklearn/compose/tests/test_target.py
Outdated
@@ -393,3 +393,17 @@ def test_transform_target_regressor_pass_extra_predict_parameters(): | |||
regr.fit(X, y) | |||
regr.predict(X, check_input=False) | |||
assert regr.regressor_.predict_called | |||
|
|||
|
|||
@pytest.mark.filterwarnings("error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see what you mean. @pytest.mark.filterwarnings("error")
is actually not used anywhere else in scikit-learn. I was finding it by googling how to make the test not pass, when a warning is raised.
Let's better use warnings.catch_warnings
instead.
@@ -350,7 +350,7 @@ def set_output(self, *, transform=None): | |||
- `"default"`: Default output format of a transformer | |||
- `"pandas"`: DataFrame output | |||
- `"polars"`: Polars output | |||
- `None`: Transform configuration is unchanged | |||
- `None`: Transform configuration is unchanged, global configuration is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was thinking that we cannot actively set transform=None
, because this is what the docstring says: transform : {"default", "pandas", "polars"}, default=None
.
Also, I have just discovered, there is
supported_outputs = ADAPTERS_MANAGER.supported_outputs
if dense_config not in supported_outputs:
raise ValueError(
f"output config must be in {sorted(supported_outputs)}, got {dense_config}"
)
to prevent us from setting transform=None
.
It can only stay None
if it is untouched, and then the global configuration is used within the _get_output_config()
function.
I'm happy to revert this if it's wrong, but I will keep my suggestions for now to facilitate discussion.
I have just reverted the documentation change on Here's a link to it, in case we want to get back to it later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Looks like the check changelog thing now complains (no idea why ...), I have added a "no changelog required" because I think removing a spurious warning is certainly a good thing but probably does not belong in the changelog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should be backported to 1.5.2
. I see that this PR is marked "no changelog required", but can we add one for 1.5.2?
Thanks @thomasjpfan, |
@thomasjpfan do you have inside knowledge that there will be a 1.5.2 🤔? I was unsure about the changelog for this one to be honest it's nice way to recognise the contributor work, but at the same time nobody is going to be like "weird I don't get this spurious warning anymore, let's look at the changelog to see whether they changed something" ... |
@StefanieSenger Thanks for this fix! |
Reference Issues/PRs
closes #29361
What does this implement/fix? Explain your changes.
Prevents warnings on output type to be raised on
TransformedTargetRegressor().fit()
when global output is set to "pandas" or "polars".Any other comments?
Did this together with @lesteve: we figured that
TransformedTargetRegressor
is in fact not a transformer and therefore there should not be a warning, neither would we expect that X is returned as a DataFrame, nor do we need to raise an exception.Setting
set_config(transform_output="pandas")
should have no effect within 10000TransformedTargetRegressor().fit()
.I have also checked other places where
FunctionTransformer
is used, but these occurances all refer to transformers. So there should be no other places where we would need to apply the same fix.In the second commit, I also suggest a documentation change to explain better the difference between
set_output(transform=None)
andset_output(transform="default")
(see this comment).