8000 ENH Add get_feature_names_out to FunctionTransformer by ageron · Pull Request #21569 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add get_feature_names_out to FunctionTransformer #21569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ageron
Copy link
Contributor
@ageron ageron commented Nov 6, 2021

Reference Issues/PRs

Follow-up on #18444.
Part of #21308.
This new feature was discussed in #21079.

What does this implement/fix? Explain your changes.

Adds the get_feature_names_out method and a new parameter feature_names_out to preprocessing.FunctionTransformer. By default, get_feature_names_out returns the input feature names, but you can set feature_names_out to return a different list, which is especially useful when the number of output features differs from the number of input features.

For example, here's a FunctionTransformer that outputs a single feature, equal to the input's mean along axis=1:

import numpy as np
from sklearn.preprocessing import FunctionTransformer

mean_transformer = FunctionTransformer(
    func=lambda X: X.mean(axis=1, keepdims=True),
    feature_names_out=["mean"]
)

X_trans = mean_transformer.fit_transform(np.random.rand(10,2))
print(mean_transformer.get_feature_names_out())  # prints ['mean']

The feature_names_out parameter may also be a callable. This is useful if the output feature names depend on the input feature names, and/or if they depend on parameters like kw_args. Here's an example that uses both. It's a transformer that appends n random features to existing features:

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

def add_n_random_features(X, n):
    return np.concatenate([X, np.random.rand(len(X), n)], axis=1)

def feature_names_out(transformer, input_features):
    n = transformer.kw_args["n"]
    return list(input_features) + [f"rnd{i}" for i in range(n)]

transformer = FunctionTransformer(
    func=add_n_random_features,
    feature_names_out=feature_names_out,
    kw_args=dict(n=3),
    validate=True,  # IMPORTANT (see discussion below)
)

df = pd.DataFrame({"a": np.random.rand(100), "b": np.random.rand(100)})
X_trans = transformer.fit_transform(df)
print(transformer.get_feature_names_out())  # prints ['a' 'b' 'rnd0' 'rnd1' 'rnd2']

Any other comments?

I have some concerns regarding the fact that validate is False by default, which means that n_features_in_ and feature_names_in_ are not set automatically. So if you create a FunctionTransformer with the default validate=False and feature_names_out=None, then when you call get_feature_names_out without any argument, it will raise an exception (unless transform was called before and func set n_feature_in_ or feature_names_in_). I tried to make this clear in the error message, but I'm worried that this will confuse users. Wdyt?

And if validate=False and you set feature_names_out to a callable, and call get_feature_names_out with no arguments, then the callable will get input_features=None as input (unless transform was called before and func set n_features_in_ or feature_names_in_). Users may be surprised by this. Should we output a warning in this case? Wdyt?

Moreover, as shown in the second code example above, the output feature names may depend on kw_args, so if feature_names_out is a callable, get_feature_names_out passes self to it, plus the input_features. I considered checking feature_names_out.__code__.co_varnames to decide whether to pass no arguments, or just the input_features, or the input_features and self. But __code__ is not used anywhere in the code base, and inspect is not used much, so I'm not sure whether such introspection would be frowned upon? I decided that it was simple enough to require users to always have two arguments: the transformer itself, and the input_features. Wdyt?

Lastly, when users want to create a FunctionTransformer that outputs a single feature, I expect that many will be tempted to set feature_names_out to a string instead of a list. To keep things consistent, I decided to raise an exception in this case, and have a clear error message to tell them to use ["foo"] instead. Wdyt?

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For validate=False and the feature_names_out parameter is set, I am propose we set feature_names_in_ and n_features_in_, but not validate it during fit or transform.

As for the API, I am thinking of restricting feature_names_out two options at first:

  • None: No feature names out
  • callable: User provide function to compute feature names out

Two more options for follow up PRs:

  • 'one-to-one': Feature names out == feature names in
  • array-like of strings: I am currently unsure about the use case for this option that the callable can not resolve. But we can discuss in a follow up.

@ageron
Copy link
Contributor Author
ageron commented Nov 6, 2021

Thanks @thomasjpfan . I'll remove the option to set feature_names_out to an array-like of strings in this PR (my goal was to have a simple API for what I expect to be a common use case, but it's not much harder to write a callable, and the API may be simpler with just one option). I 'll also replace the default option (currently None) with "one-to-one": I like this idea, it's simple enough to change, and it's more explicit.
Sounds good?

@thomasjpfan
Copy link
Member

I 'll also replace the default option (currently None) with "one-to-one": I like this idea, it's simple enough to change, and it's more explicit.

I think the default still needs to be None. I do not think the function transformer can assume 'one-to-one' in the general case.

Let's add 'one-to-one' in a future PR and get the callable in first with this PR. The smaller the PR the easier it is to review and get merged.

8000

@ageron
Copy link
Contributor Author
ageron commented Nov 6, 2021

I just read your message, I had already updated the PR to remove the option to pass an array-like of strings, and I set the default to 'one-to-one'.
Isn't the most common use case for FunctionTransformer something like FunctionTransformer(func=np.log)? If so, then one-to-one is a reasonable default, isn't it? I can revert to None, but what should get_feature_names_out do in this case? Raise an exception saying "you must set 'feature_names_out'"?

@thomasjpfan
Copy link
Member

Isn't the most common use case for FunctionTransformer something like FunctionTransformer(func=np.log)? If so, then one-to-one is a reasonable default, isn't it?

It is, but I do not think we can assume it. If a user pass a function that creates a column then 'one-to-one' would be wrong. I think ti is better to have the user explicit state it.

I can revert to None, but what should get_feature_names_out do in this case? Raise an exception saying "you must set 'feature_names_out'"?

We can use avaliable_if to conditionally expose get_feature_names_out depending on if feature_names_out is set:

def available_if(check):

@ageron
Copy link
Contributor Author
ageron commented Nov 6, 2021

Thanks @thomasjpfan . I updated the PR to make None the default. Right now get_feature_names_out raises a ValueError if feature_names_out is None, but I'll use available_if instead, thanks for the tip.

@ageron
Copy link
Contributor Author
ageron commented Nov 7, 2021

I ran black, and flake8, make test-coverage, etc., but they didn't catch the issues with the numpydoc (a newline was missing) or with v1.1.rst (someone else had forgotten a `). I looked in the Contributing doc, but I can't find instructions to catch these errors before I push the code to github. Did I miss something?

@ageron
Copy link
Contributor Author
ageron commented Nov 14, 2021

Hi @thomasjpfan, is there anything else you need me to do for this PR?

@ogrisel
Copy link
Member
ogrisel commented Nov 17, 2021

I ran black, and flake8, make test-coverage, etc., but they didn't catch the issues with the numpydoc (a newline was missing) or with v1.1.rst (someone else had forgotten a `). I looked in the Contributing doc, but I can't find instructions to catch these errors before I push the code to github. Did I miss something?

For some reason the numpydoc validation was done externally and not part as the main test suite. I am not sure why we do that. We should probably run those checks as part of the main test suite to avoid the confusion.

Copy link
Member
8000 @ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think the PR in its current state should cover most useful cases. I did not see any particular defect. Just a small improvement suggestion for one of the exception messages below:

@ageron
Copy link
Contributor Author
ageron commented Nov 17, 2021

Thanks for reviewing, Olivier. I just made the change you suggested.

@ogrisel
Copy link
Member
ogrisel commented Nov 23, 2021

test_calibration_display_default_labels[None-_line1] that is broken on the CI should have been fixed in the main branch.

@ageron
Copy link
Contributor Author
ageron commented Nov 24, 2021

test_calibration_display_default_labels[None-_line1] that is broken on the CI should have been fixed in the main branch.

In such cases, should I pull and merge main into the PR branch?

@ogrisel
Copy link
Member
ogrisel commented Nov 25, 2021

In such cases, should I pull and merge main into the PR branch?

That would not hurt, and if the PR is "CI green ticked", it might get a better chance to attract reviewers' attention :)

@ageron
Copy link
Contributor Author
ageron commented Nov 25, 2021

Thanks @ogrisel , I merged main, now there's a beautiful green tick. 😊

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @ageron !

@thomasjpfan thomasjpfan changed the title Add get_feature_names_out to FunctionTransformer ENH Add get_feature_names_out to FunctionTransformer Nov 30, 2021
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit d23d589 into scikit-learn:main Nov 30, 2021
@ageron
Copy link
Contributor Author
ageron commented Nov 30, 2021

Thanks for the review. 👍

Copy link
@nilslacroix nilslacroix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied your function in my scikit enviroment and tried to use it in my enviroment. However I still get the error as below, where preprocessor is my columntransformer and I try the following code:
preprocessor.get_feature_names_out()

Transformer argument looks like this:

('log', FunctionTransformer(np.log1p, validate=True), log_features)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [10], in <cell line: 3>()
      1 xt = preprocessor.transform(X_test)
      2 #mapie.single_estimator_[1].estimator
----> 3 preprocessor.get_feature_names_out()

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:481, in ColumnTransformer.get_feature_names_out(self, input_features)
    479 transformer_with_feature_names_out = []
    480 for name, trans, column, _ in self._iter(fitted=True):
--> 481     feature_names_out = self._get_feature_name_out_for_transformer(
    482         name, trans, column, input_features
    483     )
    484     if feature_names_out is None:
    485         continue

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:446, in ColumnTransformer._get_feature_name_out_for_transformer(self, name, trans, column, feature_names_in)
    444 # An actual transformer
    445 if not hasattr(trans, "get_feature_names_out"):
--> 446     raise AttributeError(
    447         f"Transformer {name} (type {type(trans).__name__}) does "
    448         "not provide get_feature_names_out."
    449     )
    450 if isinstance(column, Iterable) and not all(
    451     isinstance(col, str) for col in column
    452 ):
    453     column = _safe_indexing(feature_names_in, column)

AttributeError: Transformer log (type FunctionTransformer) does not provide get_feature_names_out.

@thomasjpfan
Copy link
Member

This feature is not released yet and will be released in v1.1. If you want to try out the feature now, you can install the nightly build:

pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn

@nilslacroix
Copy link
nilslacroix commented Apr 28, 2022

Are you sure this working as intended? I just installed Nightly and I still get exactly this error. The code is in my enviroment, at least function_transformer_.py has this method implemented.

image

preprocessor = ColumnTransformer(
    transformers=
    [   #('enc_obj', onehot, ["PropertyType"]),    #noworks
        ('pow', PowerTransformer(method="yeo-johnson"), pow_features),   #works
        ('log', FunctionTransformer(np.log1p, validate=True), log_features), #noworks
        ('enc_plz', BinaryEncoder(verbose=1), ["Postcode"]), #noworks
        ('enc_obj', OneHotEncoder(), ["PropertyType"]),    #noworks
        ("target", LeaveOneOutEncoder(), ["MeanPricePostcode"]),      #works
        ('features', "passthrough", unskewed_features)     #works
        
    ],
        n_jobs=-2)

pipeline2 = Pipeline(steps=[("preprocessor", preprocessor),
                           ("scaler", scaler),
                           ('clf', LinearSVR()),
                          ])        
pipeline2.fit(X_train, y_train)
pipeline2[:-1].get_feature_names_out()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [18], in <cell line: 8>()
      5 pipeline2.fit(X_train, y_train)
      6 #explainer = RegressionExplainer(pipeline2, X_test, y_test, shap="guess", n_jobs = multiprocessing.cpu_count()-1)
      7 #db1 = ExplainerDashboard(explainer)
----> 8 pipeline2[:-1].get_feature_names_out()

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\pipeline.py:733, in Pipeline.get_feature_names_out(self, input_features)
    727     if not hasattr(transform, "get_feature_names_out"):
    728         raise AttributeError(
    729             "Estimator {} does not provide get_feature_names_out. "
    730             "Did you mean to call pipeline[:-1].get_feature_names_out"
    731             "()?".format(name)
    732         )
--> 733     feature_names_out = transform.get_feature_names_out(feature_names_out)
    734 return feature_names_out

File ~\miniconda3\envs\Master_ML\
E32F
lib\site-packages\sklearn\compose\_column_transformer.py:479, in ColumnTransformer.get_feature_names_out(self, input_features)
    477 transformer_with_feature_names_out = []
    478 for name, trans, column, _ in self._iter(fitted=True):
--> 479     feature_names_out = self._get_feature_name_out_for_transformer(
    480         name, trans, column, input_features
    481     )
    482     if feature_names_out is None:
    483         continue

File ~\miniconda3\envs\Master_ML\lib\site-packages\sklearn\compose\_column_transformer.py:447, in ColumnTransformer._get_feature_name_out_for_transformer(self, name, trans, column, feature_names_in)
    445 # An actual transformer
    446 if not hasattr(trans, "get_feature_names_out"):
--> 447     raise AttributeError(
    448         f"Transformer {name} (type {type(trans).__name__}) does "
    449         "not provide get_feature_names_out."
    450     )
    451 return trans.get_feature_names_out(names)

AttributeError: Transformer log (type FunctionTransformer) does not provide get_feature_names_out.

@thomasjpfan
Copy link
Member

FunctionTransformer requires setting feature_names_out parameter to work. For simple one-to-one transformations, you can set feature_names_out to "one-to-one":

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer

mean_transformer = FunctionTransformer(
    func=np.log1p,
    feature_names_out="one-to-one",
    validate=True
)

X = pd.DataFrame({"my_feature": [1, 2, 3]})

X_trans = mean_transformer.fit_transform(X)
print(mean_transformer.get_feature_names_out())
# ['my_feature']

@nilslacroix
Copy link

Thank you Thomas ... sorry for asking all these question that might be totally obvious :(

@ageron ageron deleted the function_transformer_feature_names_out branch May 2, 2022 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0