-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Feature names with input features #13307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # sklearn/base.py # sklearn/impute.py # sklearn/preprocessing/data.py
repeating our discussion: it'd be nice if we had the pipeline set the feature names after each individual fit instead of at the end. |
@adrinjalali can you give an example with sklearn steps that would benefit from that? |
I guess it's @jnothman's example of preprocessing different features created by a count vectorizer depending on the words. But I think we decided we don't want to support that, right? |
… into get_input_features
doc/modules/compose.rst
Outdated
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable', | ||
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson', | ||
'title_bow__wrath'] | ||
['city_category__city_London', 'city_category__city_Paris', 'city_category__city_Sallisaw', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat backwards-incompatible but maybe not that bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rename get_feature_names, we no longer have the problem of backward incompatibility.
How would you implement feature names for |
Sorry for the slow reply here. The draft I wrote a month ago: https://hackmd.io/mip6r4t6QriuUUtnWTllKw?both# I will update it one of the coming days with the latest discussions here, and then make a PR on the SLEP repo |
@jorisvandenbossche I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there. |
I finally put up a PR with the SLEP text I wrote in March: scikit-learn/enhancement_proposals#18
@adrinjalali Thanks! I didn't yet address any of the comments (just cleaned up the version that was in that draft). So could you copy your comments onto the PR? |
# Conflicts: # examples/compose/plot_column_transformer_mixed_types.py # sklearn/base.py # sklearn/impute.py # sklearn/pipeline.py # sklearn/tests/test_base.py # sklearn/tests/test_pipeline.py
Any updates on this PR? |
@ajing the conversation is being continued across a bunch of different PRs, you can follow the conversation mostly here: |
This comment has been minimized.
This comment has been minimized.
The main issue with having the pipeline passing around feature names, is that the transformers won't have access to the feature names at |
How would having the feature names at fit time benefit a user? |
It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features. |
Ah I have placed #13307 (comment) in the wrong issue, I was suppose to put this one the get that defined
I think getting the feature names through and allowing the transformers to use the feature names through |
Any estimator dealing with heterogeneous features should have access to
feature names / descriptors / annotations. There is no reason to believe
that all annotations can be determined from dtype or distribution within a
sample.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some API thoughts.
>>> pipe.get_feature_names(iris.feature_names) | ||
>>> pipe.named_steps.select.input_features_ | ||
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] | ||
>>> pipe.named_steps.clf.input_features_ | ||
array(['petal length (cm)', 'petal width (cm)'], dtype='<U17') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling get_feature_names
changes the attribute input_features_
. Do we have another place where we update an attribute outside of fit
?
You can also provide custom feature names for a more human readable format using | ||
``get_feature_names``:: | ||
|
||
>>> pipe.get_feature_names(iris.feature_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returning nothing is strange. It's more of "set_input_feature_names".
having "feature_names_in" on estimators is tricky because it requires us to think about feature name propagation within non-transformer meta-estimators. I think I'm fine with only having output feature names, in particular now that we have pipeline slicing. See #18444 for the most up-to-date solution on this issue. |
Build on top of #12627 but provides users with a nicer interface:
Again using this example:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html
We now have:
or alternatively:
Transformers not implementing get_feature_names after this PR are:
['AdditiveChi2Sampler', 'FunctionTransformer', 'Imputer' (deprecated one), 'IterativeImputer', 'KBinsDiscretizer', 'KernelCenterer', 'KernelPCA', 'MissingIndicator']
possibly others that I can't test, like TfidfTransformer.
One issue I haven't thought about enough is naming for transformers like pca.
Producing "pca0, pca1" etc is good in a pipeline but in a ColumnTransformer it will probably lead to redundant names.
Currently suggested names:
feature_names_in_
,feature_names_out_
(?),update_feature_names(feature_names_in)
(or should it be input_feature_names here? probably better be consistent, right?)