8000 Feature names with input features by amueller · Pull Request #13307 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Feature names with input features #13307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 63 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
ab2acbd
work on get_feature_names for pipeline
amueller Nov 20, 2018
3bc674b
fix SimpleImputer get_feature_names
amueller Nov 20, 2018
1c4a78f
use hasattr(transform) to check whether to use final estimator in get…
amueller Nov 20, 2018
7881930
add some docstrings
amueller Nov 20, 2018
de63353
fix docstring
amueller Nov 27, 2018
8835f3b
Merge branch 'master' into pipeline_get_feature_names
amueller Feb 27, 2019
6ca8b03
add set_feature_names to pipeline, remove hack in pipeline.get_featur…
amueller Feb 27, 2019
85db7bf
Merge branch 'master' into get_input_features
amueller Feb 27, 2019
ddd0341
fix to use new _iter, deal with last transformer
amueller Feb 27, 2019
ba053ac
always call generation of feature names, generate if X has none.
amueller Feb 27, 2019
5da2207
add get_feature_names to feature selection estimators
amueller Feb 27, 2019
58d65b1
add basic test for input features in pipeline
amueller Feb 27, 2019
8026d8d
pep8, fixup docstring
amueller Feb 27, 2019
6a61ed9
add test for count vectorizer
amueller Feb 27, 2019
e0c0a54
add test for passthrough
amueller Feb 27, 2019
968163b
add tests for pandas feature names
amueller Feb 27, 2019
3fd5f6d
add feature plot with feature names to pipeline anova example
amueller Feb 27, 2019
d7c66e1
Improve the titanic column transformer example
ogrisel Feb 27, 2019
b330841
don't error when get_feature_names is not available in pipeline
amueller Feb 27, 2019
8da4ebd
start on user guide for input_features_
amueller Feb 27, 2019
533dac3
Merge branch 'get_input_features' of github.com:amueller/scikit-learn…
amueller Feb 27, 2019
372eb71
Add example for input_features_ in pipeline userguide
amueller Feb 27, 2019
66eb4e6
use self.input_features_ in get_feature_names if available.
amueller Feb 27, 2019
0d8dc70
ignore logreg deprecations
amueller Feb 27, 2019
7550aac
remove set_feature_names, reuse get_feature_names
amueller Feb 28, 2019
4287cb8
slightly easier to debug get_feature_names recursion, better test
amueller Feb 28, 2019
eb78eac
really ugly stuff to make the last 1% usecase work
amueller Feb 28, 2019
7fa6950
Merge branch 'master' into get_input_features
amueller Feb 28, 2019
d373b87
barh instead of bar in example
amueller Feb 28, 2019
4d4e6c6
test "simple" nested meta-estimator
amueller Feb 28, 2019
003fcf3
allow None in pipelines get_feature_names, don't overwrite
amueller Feb 28, 2019
f185af3
nicer error on not fitted pipeline
amueller Feb 28, 2019
acc4c76
flake8
amueller Feb 28, 2019
eef87b6
better error message, allow call to get_feature_names with None again…
amueller Feb 28, 2019
4ed56c8
replace too-smart solution with explicit simple solution for meta-est…
amueller Feb 28, 2019
8787e04
convert feature names from pandas to numpy array
amueller Feb 28, 2019
c057599
Fix get_feature_name docstrings
amueller Feb 28, 2019
fca9ac2
fix pipeline get_feature_names docstring
amueller Feb 28, 2019
d660f92
minor fix for meta-estimators with array estimators
amueller Feb 28, 2019
a74f4c4
ignore more deprecation warnings from logistic
amueller Feb 28, 2019
eefe54c
refinement of _get_sub_estimators, add crazy test
amueller Mar 1, 2019
ad48edf
typo / make crazy test pass
amueller Mar 1, 2019
4bbd8cd
add get_feature_names to TransformerMixin, overwrite in random tree e…
amueller Mar 1, 2019
8bf4960
Merge branch 'master' into get_input_features
amueller Mar 1, 2019
eb9aa52
fix docstrings
amueller Mar 1, 2019
fe4a020
add "init_" and "best_estimator_" to list of sub estimators
amueller Mar 1, 2019
750906b
pep8
amueller Mar 1, 2019
0ca6e9d
fix class name formatting, add test for pca feature names in pipeline
amueller Mar 1, 2019
7cd3dd0
ignore warnings from changing init parameters
amueller Mar 1, 2019
a85ab5e
common test for feature name length
amueller Mar 1, 2019
8001cdb
renamed one hot encoder for more intuitive feature names
amueller Mar 1, 2019
fa00af0
LDA Special case fixes
amueller Mar 1, 2019
ccfc971
only check feature names if they exist to be nice to contrib estimators
amueller Mar 1, 2019
2dae339 hackety hack
amueller Mar 1, 2019
534c4ed
Better titanic interpretation
ogrisel Mar 6, 2019
dc1c349
Phrasing in example
ogrisel Mar 7, 2019
089c65d
Apply suggestions from code review
adrinjalali Mar 7, 2019
cf86af0
Merge branch 'master' into get_input_features
amueller Mar 7, 2019
bf1b1ad
Merge branch 'get_input_features' of github.com:amueller/scikit-learn…
amueller Mar 7, 2019
bba7b8c
Merge branch 'master' into get_input_features
amueller May 31, 2019
4c17e96
fix merge issue
amueller May 31, 2019
2733d20
fix impute feature names after file was moved. merging fun
amueller May 31, 2019
bdf6cb7
Merge branch 'master' into get_input_features
amueller May 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8000
38 changes: 32 additions & 6 deletions doc/modules/compose.rst 10000
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,32 @@ or by name::
>>> pipe['reduce_dim']
PCA()

To enable model inspection, `Pipeline` sets an ``input_features_`` attribute on
all pipeline steps during fitting. This allows the user to understand how
features are transformed during a pipeline::

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> iris = load_iris()
>>> pipe = Pipeline(steps=[
... ('select', SelectKBest(k=2)),
... ('clf', LogisticRegression())])
>>> pipe.fit(iris.data, iris.target)
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
Pipeline(memory=None,
steps=[('select', SelectKBest(...)), ('clf', LogisticRegression(...))])
>>> pipe.named_steps.clf.input_features_
array(['x2', 'x3'], dtype='<U2')

You can also provide custom feature names for a more human readable format using
``get_feature_names``::

>>> pipe.get_feature_names(iris.feature_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returning nothing is strange. It's more of "set_input_feature_names".

>>> pipe.named_steps.select.input_features_
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
>>> pipe.named_steps.clf.input_features_
array(['petal length (cm)', 'petal width (cm)'], dtype='<U17')
Comment on lines +162 to +166
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling get_feature_names changes the attribute input_features_. Do we have another place where we update an attribute outside of fit?


.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_feature_selection_plot_feature_selection_pipeline.py`
Expand Down Expand Up @@ -428,7 +454,7 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import OneHotEncoder
>>> column_trans = ColumnTransformer(
... [('city_category', OneHotEncoder(dtype='int'),['city']),
... [('categories', OneHotEncoder(dtype='int'),['city']),
... ('title_bow', CountVectorizer(), 'title')],
... remainder='drop')

Expand All @@ -438,11 +464,11 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::
('title_bow', CountVectorizer(), 'title')])

>>> column_trans.get_feature_names()
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']
['categories__city_London', 'categories__city_Paris',
'categories__city_Sallisaw', 'title_bow__bow', 'title_bow__feast',
'title_bow__grapes', 'title_bow__his', 'title_bow__how', 'title_bow__last',
'title_bow__learned', 'title_bow__moveable', 'title_bow__of', 'title_bow__the',
'title_bow__trick', 'title_bow__watson', 'title_bow__wrath']

>>> column_trans.transform(X).toarray()
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
Expand Down
44 changes: 44 additions & 0 deletions examples/compose/plot_column_transformer_mixed_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,50 @@
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))


###############################################################################
# Inspecting the coefficients v 6D47 alues of the classifier
###############################################################################
# The coefficients of the final classification step of the pipeline gives an
# idea how each feature impacts the likelihood of survival assuming that the
# usual linear model assumptions hold (uncorrelated features, linear
# separability, homoschedastic errors...) which we do not verify in this
# example.
#
# To get error bars we perform cross-validation and compute the mean and
# standard deviation for each coefficient accross CV splits. Because we use a
# standard scaler on the numerical features, the coefficient weights gives us
# an idea on how much the log odds of surviving are impacted by a change in
# this dimension contrasted to the mean. Note that the categorical features
# here are overspecified which makes it slightly harder to interpret because of
# the information redundancy.
#
# We can see that the linear model coefficients are in agreement with the
# historical reports: people in higher classes and therefore in the upper decks
# were the first to reach the lifeboats, and often, priority was given to women
# and children.
#
# Note that conditionned on the "pclass_x" one-hot features, the "fare"
# numerical feature does not seem to be significantly predictive. If we drop
# the "pclass" feature, then higher "fare" values would appear significantly
# correlated with a higher likelihood of survival as the "fare" and "pclass"
# features have a strong statistical dependency.

import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedShuffleSplit

cv = StratifiedShuffleSplit(n_splits=20, test_size=0.25, random_state=42)
cv_results = cross_validate(clf, X_train, y_train, cv=cv,
return_estimator=True)
cv_coefs = np.concatenate([cv_pipeline.named_steps["classifier"].coef_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned IRL, I think that we could change the "last_estimator" of the pipeline to "last_estimator" and use it here so that this code snippet (which is a very common one) is independent of the name given to the last step:

cv_pipeline.last_estimator_.coef_

and below

pipeline.last_estimator_.input_features_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do so, I guess last_step_ would be better so that we don't have to deal with the last step potentially being only a transformer and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A transformer is regarded as an estimator. I think last_step_ is sufficient and succinct, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now have [-1] which I think is fine.

for cv_pipeline in cv_results["estimator"]])
fig, ax = plt.subplots()
ax.barh(clf.named_steps["classifier"].input_features_,
cv_coefs.mean(axis=0), xerr=cv_coefs.std(axis=0))
plt.tight_layout()
plt.show()

###############################################################################
# The resulting score is not exactly the same as the one from the previous
# pipeline becase the dtype-based selector treats the ``pclass`` columns as
Expand Down
7 changes: 5 additions & 2 deletions examples/feature_selection/plot_feature_selection_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
Using a sub-pipeline, the fitted coefficients can be mapped back into
the original feature space.
"""
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_regression
Expand Down Expand Up @@ -36,5 +37,7 @@
y_pred = anova_svm.predict(X_test)
print(classification_report(y_test, y_pred))

coef = anova_svm[:-1].inverse_transform(anova_svm['linearsvc'].coef_)
print(coef)
# access and plot the coefficients of the fitted model
plt.barh((0, 1, 2), anova_svm[-1].coef_.ravel())
plt.yticks((0, 1, 2), anova_svm[-1].input_features_)
plt.show()
117 changes: 117 additions & 0 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,23 @@
import copy
import warnings
from collections import defaultdict

import platform
import inspect
import re

import numpy as np

from . import __version__
from .exception import NotFittedError
from ._config import get_config
from .utils import _IS_32BIT
from .utils.validation import check_X_y
from .utils.validation import check_array
from .utils._estimator_html_repr import estimator_html_repr
from .utils.validation import _deprecate_positional_args


_DEFAULT_TAGS = {
'non_deterministic': False,
'requires_positive_X': False,
Expand Down Expand Up @@ -688,6 +691,49 @@ def fit_transform(self, X, y=None, **fit_params):
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)

def get_feature_names(self, input_features=None):
"""Get output feature names.

Parameters
----------
input_features : list of string or None
String names of the input features.

Returns
-------
output_feature_names : list of string
Feature names for transformer output.
"""
# OneToOneMixin is higher in the class hierarchy
# because we put mixins on the wrong side
if hasattr(super(), 'get_feature_names'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use try-except? It's more efficient...

return super().get_feature_names(input_features)
# generate feature names from class name by default
# would be much less guessing if we stored the number
# of output features.
# Ideally this would be done in each class.
if hasattr(self, 'n_clusters'):
# this is before n_components_
# because n_components_ means something else
# in agglomerative clustering
n_features = self.n_clusters
elif hasattr(self, '_max_components'):
# special case for LinearDiscriminantAnalysis
n_components = self.n_components or np.inf
n_features = min(self._max_components, n_components)
elif hasattr(self, 'n_components_'):
# n_components could be auto or None
# this is more likely to be an int
n_features = self.n_components_
elif hasattr(self, 'n_components') and self.n_components is not None:
n_features = self.n_components
elif hasattr(self, 'components_'):
n_features = self.components_.shape[0]
else:
return None
return ["{}{}".format(type(self).__name__.lower(), i)
for i in range(n_features)]


class DensityMixin:
"""Mixin class for all density estimators in scikit-learn."""
Expand Down Expand Up @@ -736,10 +782,81 @@ def fit_predict(self, X, y=None):
return self.fit(X).predict(X)


class OneToOneMixin(object):
"""Provides get_feature_names for simple transformers

Assumes there's a 1-to-1 correspondence between input features
and output features.
"""

def get_feature_names(self, input_features=None):
"""Get feature names for transformation.

Returns input_features as this transformation
doesn't add or drop features.

Parameters
----------
input_features : array-like of string
Input feature names.

Returns
-------
feature_names : array-like of string
Transformed feature names
"""
if input_features is not None:
return input_features
else:
raise ValueError("Don't know how to get"
" input feature names for {}".format(self))


def _get_sub_estimators(est):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we solve this issue by adding a "sub_estimators_" attribute on all meta-estimators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that easier / better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's the meta-estimator's responsibility to define what it's sub-estimators are.

The list above should be modified when we add meta-estimators. It cannot be extended by other packages implementing meta-estimators.

I find that in an object-oriented design, it is more natural that each class deals with specifying it's own functionality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we add meta-estimators that don't obey our conventions a test will fail.
Other packages implementing meta-estimators should either obey our conventions or they can implement their own get_feature_names.
If you wanted me to implement this on the meta-estimator, I would probably just make this a method of the MetaEstimatorMixin. Then other packages / estimators could overwrite that method instead of overwriting get_feature_names. I don't think that's better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there are several ways to delegate the responsibility to the meta-estimator, either via delegating to a method, or to an attribute. The attribute seemed pretty light to me.

With regards to avoiding premature test failures, we can simple pass when sub_estimator_ is not present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new sub-estimator was added while this discussion was happening lol.
The reason why I don't like the attribute is because it makes our API contract more verbose. I'm ok with doing it, but in the end this will make it harder for people to contribute by increasing the number of required attributes (I'd rather add n_input_features_ and n_output_features_ first ;).

I'm not sure I understand your last sentence.

I don't really care that much in either direction here, it's an implementation detail that's not really important 1241 to the larger feature addition, I think.

# Explicitly declare all fitted subestimators of existing meta-estimators
sub_ests = []
# OHE is not really needed
sub_names = ['estimator_', 'base_estimator_', 'one_hot_encoder_',
'best_estimator_', 'init_']
for name in sub_names:
sub_est = getattr(est, name, None)
if sub_est is not None:
sub_ests.append(sub_est)
if hasattr(est, "estimators_"):
if hasattr(est.estimators_, 'shape'):
sub_ests.extend(est.estimators_.ravel())
else:
sub_ests.extend(est.estimators_)
return sub_ests


class MetaEstimatorMixin:
_required_parameters = ["estimator"]
"""Mixin class for all meta estimators in scikit-learn."""

def get_feature_names(self, input_features=None):
"""Ensure feature names are set on sub-estimators

Parameters
----------
input_features : list of string or None
Input features to the meta-estimator.
"""
sub_ests = _get_sub_estimators(self)
for est in sub_ests:
est.input_features_ = input_features
if hasattr(est, "get_feature_names"):
# doing hassattr instead of a try-except on everything
# b/c catching AttributeError makes recursive code
# impossible to debug
try:
est.get_feature_names(input_features=input_features)
except TypeError:
# do we need this?
est.get_feature_names()
except NotFittedError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a meta-estimator sometimes be a transformer? If so, we should return feature names here in that specific case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectFromModel is both



class MultiOutputMixin:
"""Mixin to mark estimators that support multioutput."""
Expand Down
6 changes: 5 additions & 1 deletion sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,8 +371,12 @@ def get_feature_names(self):
raise AttributeError("Transformer %s (type %s) does not "
"provide get_feature_names."
% (str(name), type(trans).__name__))
try:
more_names = trans.get_feature_names(input_features=column)
except TypeError:
more_names = trans.get_feature_names()
feature_names.extend([name + "__" + f for f in
trans.get_feature_names()])
more_names])
return feature_names

def _update_fitted_transformers(self, transformers):
Expand Down
13 changes: 13 additions & 0 deletions sklearn/compose/tests/test_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler, Normalizer, OneHotEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline


class Trans(BaseEstimator):
Expand Down Expand Up @@ -659,6 +660,18 @@ def test_column_transformer_get_feature_names():
assert_raise_message(AttributeError,
"Transformer trans (type Trans) does not provide "
"get_feature_names", ct.get_feature_names)

# if some transformers support and some don't
ct = ColumnTransformer([('trans', Trans(), [0, 1]),
('scale', StandardScaler(), [0])])
ct.fit(X_array)
assert_raise_message(AttributeError,
"Transformer trans (type Trans) does not provide "
"get_feature_names", ct.get_feature_names)

# inside a pipeline
make_pipeline(ct).fit(X_array)


# working example
X = np.array([[{'a': 1, 'b': 2}, {'a': 3, 'b': 4}],
Expand Down
9 changes: 9 additions & 0 deletions sklearn/ensemble/_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -2357,3 +2357,12 @@ def transform(self, X):
"""
check_is_fitted(self)
return self.one_hot_encoder_.transform(self.apply(X))

def get_feature_names(self, input_features=None):
"""Feature names - not implemented yet.

Parameters
----------
input_features : list of strings or None
"""
return None
15 changes: 15 additions & 0 deletions sklearn/feature_selection/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,21 @@ def inverse_transform(self, X):
Xt[:, support] = X
return Xt

def get_feature_names(self, input_features=None):
"""Mask feature names according to selected features.

Parameters
----------
input_features : list of string or None
Input features to select from. If none, they are generated as
x0, x1, ..., xn.
"""
mask = self.get_support()
if input_features is None:
input_features = ['x%d' % i
for i in range(mask.shape[0])]
return np.array(input_features)[mask]


def _get_feature_importances(estimator, getter, transform_func=None,
norm_order=1):
Expand Down
Loading
0