-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Add get_feature_names_out to FunctionTransformer #21569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add get_feature_names_out to FunctionTransformer #21569
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For validate=False
and the feature_names_out
parameter is set, I am propose we set feature_names_in_
and n_features_in_
, but not validate it during fit
or transform
.
As for the API, I am thinking of restricting feature_names_out
two options at first:
None
: No feature names out- callable: User provide function to compute feature names out
Two more options for follow up PRs:
'one-to-one'
: Feature names out == feature names in- array-like of strings: I am currently unsure about the use case for this option that the callable can not resolve. But we can discuss in a follow up.
Thanks @thomasjpfan . I'll remove the option to set |
I think the default still needs to be Let's add |
…ake default 'one-to-one'
I just read your message, I had already updated the PR to remove the option to pass an array-like of strings, and I set the default to 'one-to-one'. |
It is, but I do not think we can assume it. If a user pass a function that creates a column then
We can use scikit-learn/sklearn/utils/metaestimators.py Line 140 in 48e83df
|
Thanks @thomasjpfan . I updated the PR to make None the default. Right now get_feature_names_out raises a ValueError if |
I ran black, and flake8, make test-coverage, etc., but they didn't catch the issues with the numpydoc (a newline was missing) or with v1.1.rst (someone else had forgotten a `). I looked in the Contributing doc, but I can't find instructions to catch these errors before I push the code to github. Did I miss something? |
Hi @thomasjpfan, is there anything else you need me to do for this PR? |
For some reason the numpydoc validation was done externally and not part as the main test suite. I am not sure why we do that. We should probably run those checks as part of the main test suite to avoid the confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think the PR in its current state should cover most useful cases. I did not see any particular defect. Just a small improvement suggestion for one of the exception messages below:
…geron/scikit-learn into function_transformer_feature_names_out
Thanks for reviewing, Olivier. I just made the change you suggested. |
|
In such cases, should I pull and merge |
That would not hurt, and if the PR is "CI green ticked", it might get a better chance to attract reviewers' attention :) |
Thanks @ogrisel , I merged main, now there's a beautiful green tick. 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @ageron !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the review. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied your function in my scikit enviroment and tried to use it in my enviroment. However I still get the error as below, where preprocessor is my columntransformer and I try the following code:
preprocessor.get_feature_names_out()
Transformer argument looks like this:
('log', FunctionTransformer(np.log1p, validate=True), log_features)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [10], in <cell line: 3>()
1 xt = preprocessor.transform(X_test)
2 #mapie.single_estimator_[1].estimator
----> 3 preprocessor.get_feature_names_out()
File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:481, in ColumnTransformer.get_feature_names_out(self, input_features)
479 transformer_with_feature_names_out = []
480 for name, trans, column, _ in self._iter(fitted=True):
--> 481 feature_names_out = self._get_feature_name_out_for_transformer(
482 name, trans, column, input_features
483 )
484 if feature_names_out is None:
485 continue
File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:446, in ColumnTransformer._get_feature_name_out_for_transformer(self, name, trans, column, feature_names_in)
444 # An actual transformer
445 if not hasattr(trans, "get_feature_names_out"):
--> 446 raise AttributeError(
447 f"Transformer {name} (type {type(trans).__name__}) does "
448 "not provide get_feature_names_out."
449 )
450 if isinstance(column, Iterable) and not all(
451 isinstance(col, str) for col in column
452 ):
453 column = _safe_indexing(feature_names_in, column)
AttributeError: Transformer log (type FunctionTransformer) does not provide get_feature_names_out.
This feature is not released yet and will be released in v1.1. If you want to try out the feature now, you can install the nightly build: pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn |
Are you sure this working as intended? I just installed Nightly and I still get exactly this error. The code is in my enviroment, at least function_transformer_.py has this method implemented.
|
import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
mean_transformer = FunctionTransformer(
func=np.log1p,
feature_names_out="one-to-one",
validate=True
)
X = pd.DataFrame({"my_feature": [1, 2, 3]})
X_trans = mean_transformer.fit_transform(X)
print(mean_transformer.get_feature_names_out())
# ['my_feature'] |
Thank you Thomas ... sorry for asking all these question that might be totally obvious :( |
Reference Issues/PRs
Follow-up on #18444.
Part of #21308.
This new feature was discussed in #21079.
What does this implement/fix? Explain your changes.
Adds the
get_feature_names_out
method and a new parameterfeature_names_out
topreprocessing.FunctionTransformer
. By default,get_feature_names_out
returns the input feature names, but you can setfeature_names_out
to return a different list, which is especially useful when the number of output features differs from the number of input features.For example, here's a
FunctionTransformer
that outputs a single feature, equal to the input's mean along axis=1:The
feature_names_out
parameter may also be a callable. This is useful if the output feature names depend on the input feature names, and/or if they depend on parameters likekw_args
. Here's an example that uses both. It's a transformer that appendsn
random features to existing features:Any other comments?
I have some concerns regarding the fact that
validate
isFalse
by default, which means thatn_features_in_
andfeature_names_in_
are not set automatically. So if you create aFunctionTransformer
with the defaultvalidate=False
andfeature_names_out=None
, then when you callget_feature_names_out
without any argument, it will raise an exception (unlesstransform
was called before andfunc
setn_feature_in_
orfeature_names_in_
). I tried to make this clear in the error message, but I'm worried that this will confuse users. Wdyt?And if
validate=False
and you setfeature_names_out
to a callable, and callget_feature_names_out
with no arguments, then the callable will getinput_features=None
as input (unlesstransform
was called before andfunc
setn_features_in_
orfeature_names_in_
). Users may be surprised by this. Should we output a warning in this case? Wdyt?Moreover, as shown in the second code example above, the output feature names may depend on
kw_args
, so iffeature_names_out
is a callable,get_feature_names_out
passesself
to it, plus theinput_features
. I considered checkingfeature_names_out.__code__.co_varnames
to decide whether to pass no arguments, or just theinput_features
, or theinput_features
andself
. But__code__
is not used anywhere in the code base, andinspect
is not used much, so I'm not sure whether such introspection would be frowned upon? I decided that it was simple enough to require users to always have two arguments: the transformer itself, and theinput_features
. Wdyt?Lastly, when users want to create a
FunctionTransformer
that outputs a single feature, I expect that many will be tempted to setfeature_names_out
to a string instead of a list. To keep things consistent, I decided to raise an exception in this case, and have a clear error message to tell them to use["foo"]
instead. Wdyt?