10000 Feature names with input features by amueller · Pull Request #13307 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Feature names with input features #13307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 63 commits into from

Conversation

amueller
Copy link
Member
@amueller amueller commented Feb 27, 2019

Build on top of #12627 but provides users with a nicer interface:
Again using this example:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html

We now have:

clf.named_steps['classifier'].input_features_

or alternatively:

clf.get_feature_names(X.columns.map(lambda x: x.upper()))
clf.named_steps['classifier'].input_features_

Transformers not implementing get_feature_names after this PR are:
['AdditiveChi2Sampler', 'FunctionTransformer', 'Imputer' (deprecated one), 'IterativeImputer', 'KBinsDiscretizer', 'KernelCenterer', 'KernelPCA', 'MissingIndicator']

possibly others that I can't test, like TfidfTransformer.

One issue I haven't thought about enough is naming for transformers like pca.
Producing "pca0, pca1" etc is good in a pipeline but in a ColumnTransformer it will probably lead to redundant names.

Currently suggested names:
feature_names_in_, feature_names_out_(?), update_feature_names(feature_names_in) (or should it be input_feature_names here? probably better be consistent, right?)

@adrinjalali
Copy link
Member

repeating our discussion: it'd be nice if we had the pipeline set the feature names after each individual fit instead of at the end.

@amueller
Copy link
Member Author

@adrinjalali can you give an example with sklearn steps that would benefit from that?

@amueller
Copy link
Member Author

I guess it's @jnothman's example of preprocessing different features created by a count vectorizer depending on the words. But I think we decided we don't want to support that, right?
Even if you set them, that wouldn't help you as the next estimator doesn't have access to the previous estimator.

'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']
['city_category__city_London', 'city_category__city_Paris', 'city_category__city_Sallisaw',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat backwards-incompatible but maybe not that bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename get_feature_names, we no longer have the problem of backward incompatibility.

@amueller
Copy link
Member Author
amueller commented Apr 5, 2019

How would you implement feature names for FunctionTransformer? Have a feature_names parameter that can be 'passthrough' or a callable?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche did you post your SLEP draft? Can you make it a PR on the SLEP repo?

Sorry for the slow reply here. The draft I wrote a month ago: https://hackmd.io/mip6r4t6QriuUUtnWTllKw?both#

I will update it one of the coming days with the latest discussions here, and then make a PR on the SLEP repo

@adrinjalali
Copy link
Member

@jorisvandenbossche I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there.

@jorisvandenbossche
Copy link
Member

I finally put up a PR with the SLEP text I wrote in March: scikit-learn/enhancement_proposals#18

I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there.

@adrinjalali Thanks! I didn't yet address any of the comments (just cleaned up the version that was in that draft). So could you copy your comments onto the PR?

amueller added 3 commits May 31, 2019 10:57
# Conflicts:
#	examples/compose/plot_column_transformer_mixed_types.py
#	sklearn/base.py
#	sklearn/impute.py
#	sklearn/pipeline.py
#	sklearn/tests/test_base.py
#	sklearn/tests/test_pipeline.py
@ajing
Copy link
ajing commented Feb 2, 2020

Any updates on this PR?

@adrinjalali
Copy link
Member

@ajing the conversation is being continued across a bunch of different PRs, you can follow the conversation mostly here:
scikit-learn/enhancement_proposals#17
scikit-learn/enhancement_proposals#18
scikit-learn/enhancement_proposals#25

@thomasjpfan

This comment has been minimized.

@adrinjalali
Copy link
Member

The main issue with having the pipeline passing around feature names, is that the transformers won't have access to the feature names at fit time. An alternative is to pass the feature names as a parameter to fit, which breaks our current API and I guess with @amueller 's implementations we decided not to go down that path.

@thomasjpfan
Copy link
Member

How would having the feature names at fit time benefit a user?

@adrinjalali
Copy link
Member

It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features.

@thomasjpfan
Copy link
Member

Ah I have placed #13307 (comment) in the wrong issue, I was suppose to put this one the get that defined get_feature_names everywhere: #12627

It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features.

I think getting the feature names through and allowing the transformers to use the feature names through DataArray can be seen as two separate problems. Imagine we have DataArray, how do we get the feature names inputed in classifier in a pipeline? Either we call pipe[:-1].transform(dataarray) and look at the columns or we store the names in the classifier and directly inspect the columns. Calling transform seems a bit much just to get the feature names and storing all the names in the classifier seems a little wasteful. (I am always uneasy with adding more requirements to the API.)

@jnothman
Copy link
Member
jnothman commented May 22, 2020 via email

Copy link
Member
@thomasjpfan thomasjpfan left a comment
10000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some API thoughts.

Comment on lines +162 to +166
>>> pipe.get_feature_names(iris.feature_names)
>>> pipe.named_steps.select.input_features_
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
>>> pipe.named_steps.clf.input_features_
array(['petal length (cm)', 'petal width (cm)'], dtype='<U17')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling get_feature_names changes the attribute input_features_. Do we have another place where we update an attribute outside of fit?

You can also provide custom feature names for a more human readable format using
``get_feature_names``::

>>> pipe.get_feature_names(iris.feature_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returning nothing is strange. It's more of "set_input_feature_names".

@amueller
Copy link
Member Author
amueller commented Jul 26, 2021

having "feature_names_in" on estimators is tricky because it requires us to think about feature name propagation within non-transformer meta-estimators. I think I'm fine with only having output feature names, in particular now that we have pipeline slicing.

See #18444 for the most up-to-date solution on this issue.

@amueller amueller closed this Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0