Feature names with input features #13307

amueller · 2019-02-27T13:07:07Z

Build on top of #12627 but provides users with a nicer interface:
Again using this example:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html

We now have:

clf.named_steps['classifier'].input_features_

or alternatively:

clf.get_feature_names(X.columns.map(lambda x: x.upper()))
clf.named_steps['classifier'].input_features_

Transformers not implementing get_feature_names after this PR are:
['AdditiveChi2Sampler', 'FunctionTransformer', 'Imputer' (deprecated one), 'IterativeImputer', 'KBinsDiscretizer', 'KernelCenterer', 'KernelPCA', 'MissingIndicator']

possibly others that I can't test, like TfidfTransformer.

One issue I haven't thought about enough is naming for transformers like pca.
Producing "pca0, pca1" etc is good in a pipeline but in a ColumnTransformer it will probably lead to redundant names.

Currently suggested names:
feature_names_in_, feature_names_out_(?), update_feature_names(feature_names_in) (or should it be input_feature_names here? probably better be consistent, right?)

…_feature_names

# Conflicts: # sklearn/base.py # sklearn/impute.py # sklearn/preprocessing/data.py

…e_names

adrinjalali · 2019-02-27T13:46:41Z

repeating our discussion: it'd be nice if we had the pipeline set the feature names after each individual fit instead of at the end.

amueller · 2019-02-27T14:47:37Z

@adrinjalali can you give an example with sklearn steps that would benefit from that?

amueller · 2019-02-27T14:48:50Z

I guess it's @jnothman's example of preprocessing different features created by a count vectorizer depending on the words. But I think we decided we don't want to support that, right?
Even if you set them, that wouldn't help you as the next estimator doesn't have access to the previous estimator.

… into get_input_features

amueller · 2019-02-27T17:03:30Z

doc/modules/compose.rst

-  'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
-  'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
-  'title_bow__wrath']
+  ['city_category__city_London', 'city_category__city_Paris', 'city_category__city_Sallisaw',


This is somewhat backwards-incompatible but maybe not that bad?

If we rename get_feature_names, we no longer have the problem of backward incompatibility.

sklearn/pipeline.py

sklearn/compose/_column_transformer.py

sklearn/pipeline.py

sklearn/base.py

amueller · 2019-04-05T20:24:40Z

How would you implement feature names for FunctionTransformer? Have a feature_names parameter that can be 'passthrough' or a callable?

jorisvandenbossche · 2019-04-05T21:07:18Z

@jorisvandenbossche did you post your SLEP draft? Can you make it a PR on the SLEP repo?

Sorry for the slow reply here. The draft I wrote a month ago: https://hackmd.io/mip6r4t6QriuUUtnWTllKw?both#

I will update it one of the coming days with the latest discussions here, and then make a PR on the SLEP repo

adrinjalali · 2019-04-10T16:25:38Z

@jorisvandenbossche I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there.

jorisvandenbossche · 2019-05-28T19:28:57Z

I finally put up a PR with the SLEP text I wrote in March: scikit-learn/enhancement_proposals#18

I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there.

@adrinjalali Thanks! I didn't yet address any of the comments (just cleaned up the version that was in that draft). So could you copy your comments onto the PR?

# Conflicts: # examples/compose/plot_column_transformer_mixed_types.py # sklearn/base.py # sklearn/impute.py # sklearn/pipeline.py # sklearn/tests/test_base.py # sklearn/tests/test_pipeline.py

ajing · 2020-02-02T00:00:59Z

Any updates on this PR?

adrinjalali · 2020-02-02T15:29:55Z

@ajing the conversation is being continued across a bunch of different PRs, you can follow the conversation mostly here:
scikit-learn/enhancement_proposals#17
scikit-learn/enhancement_proposals#18
scikit-learn/enhancement_proposals#25

adrinjalali · 2020-02-18T18:47:19Z

The main issue with having the pipeline passing around feature names, is that the transformers won't have access to the feature names at fit time. An alternative is to pass the feature names as a parameter to fit, which breaks our current API and I guess with @amueller 's implementations we decided not to go down that path.

thomasjpfan · 2020-02-18T18:50:21Z

How would having the feature names at fit time benefit a user?

adrinjalali · 2020-02-18T18:54:21Z

It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features.

thomasjpfan · 2020-02-18T19:11:40Z

Ah I have placed #13307 (comment) in the wrong issue, I was suppose to put this one the get that defined get_feature_names everywhere: #12627

It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features.

I think getting the feature names through and allowing the transformers to use the feature names through DataArray can be seen as two separate problems. Imagine we have DataArray, how do we get the feature names inputed in classifier in a pipeline? Either we call pipe[:-1].transform(dataarray) and look at the columns or we store the names in the classifier and directly inspect the columns. Calling transform seems a bit much just to get the feature names and storing all the names in the classifier seems a little wasteful. (I am always uneasy with adding more requirements to the API.)

jnothman · 2020-05-22T00:23:17Z

Any estimator dealing with heterogeneous features should have access to feature names / descriptors / annotations. There is no reason to believe that all annotations can be determined from dtype or distribution within a sample.

thomasjpfan

Some API thoughts.

thomasjpfan · 2020-05-25T20:05:12Z

doc/modules/compose.rst

+    >>> pipe.get_feature_names(iris.feature_names)
+    >>> pipe.named_steps.select.input_features_
+    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
+    >>> pipe.named_steps.clf.input_features_
+    array(['petal length (cm)', 'petal width (cm)'], dtype='<U17')


Calling get_feature_names changes the attribute input_features_. Do we have another place where we update an attribute outside of fit?

thomasjpfan · 2020-05-25T20:05:52Z

doc/modules/compose.rst

+You can also provide custom feature names for a more human readable format using
+``get_feature_names``::
+
+    >>> pipe.get_feature_names(iris.feature_names)


This returning nothing is strange. It's more of "set_input_feature_names".

amueller · 2021-07-26T19:22:37Z

having "feature_names_in" on estimators is tricky because it requires us to think about feature name propagation within non-transformer meta-estimators. I think I'm fine with only having output feature names, in particular now that we have pipeline slicing.

See #18444 for the most up-to-date solution on this issue.

amueller added 9 commits November 20, 2018 10:52

work on get_feature_names for pipeline

ab2acbd

fix SimpleImputer get_feature_names

3bc674b

use hasattr(transform) to check whether to use final estimator in get…

1c4a78f

…_feature_names

add some docstrings

7881930

fix docstring

de63353

Merge branch 'master' into pipeline_get_feature_names

8000
8835f3b

# Conflicts: # sklearn/base.py # sklearn/impute.py # sklearn/preprocessing/data.py

add set_feature_names to pipeline, remove hack in pipeline.get_featur…

6ca8b03

…e_names

Merge branch 'master' into get_input_features

85db7bf

fix to use new _iter, deal with last transformer

ddd0341

amueller added 7 commits February 27, 2019 15:12

always call generation of feature names, generate if X has none.

ba053ac

add get_feature_names to feature selection estimators

5da2207

add basic test for input features in pipeline

58d65b1

pep8, fixup docstring

8026d8d

add test for count vectorizer

6a61ed9

add test for passthrough

e0c0a54

add tests for pandas feature names

968163b

amueller and others added 6 commits February 27, 2019 16:02

add feature plot with feature names to pipeline anova example

3fd5f6d

Improve the titanic column transformer example

d7c66e1

don't error when get_feature_names is not available in pipeline

b330841

start on user guide for input_features_

8da4ebd

Merge branch 'get_input_features' of github.com:amueller/scikit-learn…

533dac3

… into get_input_features

Add example for input_features_ in pipeline userguide

372eb71

amueller commented Feb 27, 2019

View reviewed changes

use self.input_features_ in get_feature_names if available.

66eb4e6

jorisvandenbossche reviewed Feb 27, 2019

View reviewed changes

sklearn/pipeline.py Outdated Show resolved Hide resolved

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

sklearn/pipeline.py Outdated Show resolved Hide resolved

sklearn/pipeline.py Show resolved Hide resolved

jorisvandenbossche reviewed Feb 27, 2019

View reviewed changes

sklearn/base.py Show resolved Hide resolved

ignore logreg deprecations

0d8dc70

jnothman mentioned this pull request Apr 9, 2019

Number of features constant property #13599

Closed

NicolasHug mentioned this pull request Apr 10, 2019

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

Closed

jorisvandenbossche mentioned this pull request May 28, 2019

SLEP 8: Propagating feature names scikit-learn/enhancement_proposals#18

Closed

amueller added 3 commits May 31, 2019 10:57

Merge branch 'master' into get_input_features

bba7b8c

# Conflicts: # examples/compose/plot_column_transformer_mixed_types.py # sklearn/base.py # sklearn/impute.py # sklearn/pipeline.py # sklearn/tests/test_base.py # sklearn/tests/test_pipeline.py

fix merge issue

4c17e96

fix impute feature names after file was moved. merging fun

2733d20

animeshsingh mentioned this pull request Jun 25, 2019

Make AIF360 default bias checker and mitigator in scikit-learn #14181

Open

11 tasks

amueller mentioned this pull request Jun 26, 2019

Pandas in, Pandas out? #5523

Closed

adrinjalali mentioned this pull request Jul 12, 2019

feature names - NamedArray #14315

Closed

amueller added Needs Decision Requires decision Needs work labels Aug 6, 2019

This comment has been minimized.

Sign in to view

amueller mentioned this pull request Feb 18, 2020

RFC Implement Pipeline get feature names #12627

Closed

3 tasks

Merge branch 'master' into get_input_features

bdf6cb7

thomasjpfan reviewed May 25, 2020

View reviewed changes

thomasjpfan mentioned this pull request Sep 23, 2020

API Implements get_feature_names_out for transformers that support get_feature_names #18444

Merged

Base automatically changed from master to main January 22, 2021 10:50

amueller closed this Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature names with input features #13307

Feature names with input features #13307

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Feature names with input features #13307

Feature names with input features #13307

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!