8000 RFC Implement Pipeline get feature names by amueller · Pull Request #12627 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

RFC Implement Pipeline get feature names #12627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ab2acbd
work on get_feature_names for pipeline
amueller Nov 20, 2018
3bc674b
fix SimpleImputer get_feature_names
amueller Nov 20, 2018
1c4a78f
use hasattr(transform) to check whether to use final estimator in get… 8000
amueller Nov 20, 2018
7881930
add some docstrings
amueller Nov 20, 2018
de63353
fix docstring
amueller Nov 27, 2018
8835f3b
Merge branch 'master' into pipeline_get_feature_names
amueller Feb 27, 2019
2eba5de
fix merge issues with master
amueller May 30, 2019
449ed23
fix merge issue
amueller May 31, 2019
a1fcf67
Merge branch 'master' into pipeline_get_feature_names
amueller May 21, 2020
b929341
don't do magic slicing in pipeline.get_feature_names
amueller May 21, 2020
2b613e5
fix merge issue
amueller May 21, 2020
ad66b86
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
amueller May 24, 2020
5eb7603
trying to merge with input feature pr
amueller Jun 2, 2020
f4f832a
Merge branch 'master' into pipeline_get_feature_names
amueller Jun 2, 2020
3a9054c
remove tests taht don't apply
amueller Jun 2, 2020
9c4420d
Merge branch 'pipeline_get_feature_names' of github.com:amueller/scik…
amueller Jun 2, 2020
76f5b54
fix onetoone mixing feature names
amueller Jun 2, 2020
52f38e1
remove more tests
amueller Jun 2, 2020
cdda1fb
fix test for better expected outputs
amueller Jun 2, 2020
5f4abbc
fix priorities in catch-all get_feature_names
amueller Jun 2, 2020
4305a28
flake8
amueller Jun 2, 2020
c387b5b
remove redundant code
amueller Jun 2, 2020
2fefb67
fix error message
amueller Jun 2, 2020
a6832c3
fix mixin order
amueller Jun 2, 2020
0f45b22
small refactor with helper function
amueller Jun 2, 2020
4717a73
linting for new options
amueller Jun 3, 2020
a658ba7
add feature names to lineardiscriminantanalysis and birch
amueller Jun 3, 2020
e9e45af
add get_feature_names in a couple more places
amueller Jun 3, 2020
5acaced
fix up docs
amueller Jun 3, 2020
0353f69
make example actually work
amueller Jun 3, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 31 additions & 10 deletions doc/modules/compose.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,27 @@ or by name::
>>> pipe['reduce_dim']
PCA()

To enable model inspection, `Pipeline` has an ``get_feature_names()`` method,
just like all transformers. You can use pipeline slicing to get the feature names
going into each step::

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> iris = load_iris()
>>> pipe = Pipeline(steps=[
... ('select', SelectKBest(k=2)),
... ('clf', LogisticRegression())])
>>> pipe.fit(iris.data, iris.target)
Pipeline(steps=[('select', SelectKBest(...)), ('clf', LogisticRegression(...))])
>>> pipe[:-1].get_feature_names()
array(['x2', 'x3'], dtype='<U2')

You can also provide custom feature names for a more human readable format using
``get_feature_names``::

>>> pipe[:-1].get_feature_names(iris.feature_names)
array(['petal length (cm)', 'petal width (cm)'], dtype='<U17')

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_feature_selection_plot_feature_selection_pipeline.py`
Expand Down Expand Up @@ -428,21 +449,21 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import OneHotEncoder
>>> column_trans = ColumnTransformer(
... [('city_category', OneHotEncoder(dtype='int'),['city']),
... [('categories', OneHotEncoder(dtype='int'),['city']),
... ('title_bow', CountVectorizer(), 'title')],
... remainder='drop')

>>> column_trans.fit(X)
Column 8000 Transformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
ColumnTransformer(transformers=[('categories', OneHotEncoder(dtype='int'),
['city']),
('title_bow', CountVectorizer(), 'title')])

>>> column_trans.get_feature_names()
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']
['categories__city_London', 'categories__city_Paris',
'categories__city_Sallisaw', 'title_bow__bow', 'title_bow__feast',
'title_bow__grapes', 'title_bow__his', 'title_bow__how', 'title_bow__last',
'title_bow__learned', 'title_bow__moveable', 'title_bow__of', 'title_bow__the',
'title_bow__trick', 'title_bow__watson', 'title_bow__wrath']

>>> column_trans.transform(X).toarray()
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
Expand All @@ -459,7 +480,7 @@ to specify the column as a list of strings (``['city']``).

Apart from a scalar or a single item list, the column selection can be specified
as a list of multiple items, an integer array, a slice, a boolean mask, or
with a :func:`~sklearn.compose.make_column_selector`. The
with a :func:`~sklearn.compose.make_column_selector`. The
:func:`~sklearn.compose.make_column_selector` is used to select columns based
on data type or column name::

Expand Down Expand Up @@ -544,8 +565,8 @@ many estimators. This visualization is activated by setting the
>>> # diplays HTML representation in a jupyter context
>>> column_trans # doctest: +SKIP

An example of the HTML output can be seen in the
**HTML representation of Pipeline** section of
An example of the HTML output can be seen in the
**HTML representation of Pipeline** section of
:ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`.
As an alternative, the HTML can be written to a file using
:func:`~sklearn.utils.estimator_html_repr`::
Expand Down
44 changes: 44 additions & 0 deletions examples/compose/plot_column_transformer_mixed_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,50 @@
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))


###############################################################################
# Inspecting the coefficients values of the classifier
###############################################################################
# The coefficients of the final classification step of the pipeline gives an
# idea how each feature impacts the likelihood of survival assuming that the
# usual linear model assumptions hold (uncorrelated features, linear
# separability, homoschedastic errors...) which we do not verify in this
# example.
#
# To get error bars we perform cross-validation and compute the mean and
# standard deviation for each coefficient accross CV splits. Because we use a
# standard scaler on the numerical features, the coefficient weights gives us
# an idea on how much the log odds of surviving are impacted by a change in
# this dimension contrasted to the mean. Note that the categorical features
# here are overspecified which makes it slightly harder to interpret because of
# the information redundancy.
#
# We can see that the linear model coefficients are in agreement with the
# historical reports: people in higher classes and therefore in the upper decks
# were the first to reach the lifeboats, and often, priority was given to women
# and children.
#
# Note that conditionned on the "pclass_x" one-hot features, the "fare"
# numerical feature does not seem to be significantly predictive. If we drop
# the "pclass" feature, then higher "fare" values would appear significantly
# correlated with a higher likelihood of survival as the "fare" and "pclass"
# features have a strong statistical dependency.

import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedShuffleSplit

cv = StratifiedShuffleSplit(n_splits=20, test_size=0.25, random_state=42)
cv_results = cross_validate(clf, X_train, y_train, cv=cv,
return_estimator=True)
cv_coefs = np.concatenate([cv_pipeline[-1].coef_
for cv_pipeline in cv_results["estimator"]])
fig, ax = plt.subplots()
ax.barh(clf[:-1].get_feature_names(),
cv_coefs.mean(axis=0), xerr=cv_coefs.std(axis=0))
plt.tight_layout()
plt.show()

###############################################################################
# The resulting score is not exactly the same as the one from the previous
# pipeline becase the dtype-based selector treats the ``pclass`` columns as
Expand Down
9 changes: 6 additions & 3 deletions examples/feature_selection/plot_feature_selection_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
Using a sub-pipeline, the fitted coefficients can be mapped back into
the original feature space.
"""
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_regression
Expand All @@ -20,7 +21,7 @@

# import some data to play with
X, y = make_classification(
n_features=20, n_informative=3, n_redundant=0, n_classes=4,
n_features=20, n_informative=3, n_redundant=0, n_classes=2,
n_clusters_per_class=2)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Expand All @@ -36,5 +37,7 @@
y_pred = anova_svm.predict(X_test)
print(classification_report(y_test, y_pred))

coef = anova_svm[:-1].inverse_transform(anova_svm['linearsvc'].coef_)
print(coef)
# access and plot the coefficients of the fitted model
plt.barh((0, 1, 2), anova_svm[-1].coef_.ravel())
plt.yticks((0, 1, 2), anova_svm[:-1].get_feature_names())
plt.show()
68 changes: 68 additions & 0 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .utils import _IS_32BIT
from .utils.validation import check_X_y
from .utils.validation import check_array
from .utils._feature_names import _make_feature_names
from .utils._estimator_html_repr import estimator_html_repr
from .utils.validation import _deprecate_positional_args

Expand Down Expand Up @@ -689,6 +690,45 @@ def fit_transform(self, X, y=None, **fit_params):
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)

def get_feature_names(self, input_features=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can push this down if people think having it here is ugly.

"""Get output feature names.

Parameters
----------
input_features : list of string or None
String names of the input features.

Returns
-------
output_feature_names : list of string
Feature names for transformer output.
"""
# generate feature names from class name by default
# would be much less guessing if we stored the number
# of output features.
# Ideally this would be done in each class.
if hasattr(self, 'n_clusters'):
# this is before n_components_
# because n_components_ means something else
# in agglomerative clustering
n_features = self.n_clusters
elif hasattr(self, '_max_components'):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops this can be removed, it's in the class now

# special case for LinearDiscriminantAnalysis
n_components = self.n_components or np.inf
n_features = min(self._max_components, n_components)
elif hasattr(self, 'n_components_'):
# n_components could be auto or None
# this is more likely to be an int
n_features = self.n_components_
elif hasattr(self, 'components_'):
n_features = self.components_.shape[0]
elif hasattr(self, 'n_components') and self.n_components is not None:
n_features = self.n_components
else:
return None
return _make_feature_names(n_features=n_features,
prefix=type(self).__name__.lower())


class DensityMixin:
"""Mixin class for all density estimators in scikit-learn."""
Expand Down Expand Up @@ -737,6 +777,34 @@ def fit_predict(self, X, y=None):
return self.fit(X).predict(X)


class OneToOneMixin(object):
"""Provides get_feature_names for simple transformers

Assumes there's a 1-to-1 correspondence between input features
and output features.
"""

def get_feature_names(self, input_features=None):
"""Get feature names for transformation.

Returns input_features as this transformation
doesn't add or drop features.

Parameters
----------
input_features : array-like of string
Input feature names.

Returns
-------
feature_names : array-like of string
Transformed feature names
"""

return _make_feature_names(self.n_features_in_,
input_features=input_features)


class MetaEstimatorMixin:
_required_parameters = ["estimator"]
"""Mixin class for all meta estimators in scikit-learn."""
Expand Down
18 changes: 18 additions & 0 deletions sklearn/cluster/_birch.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from ..utils import check_array
from ..utils.extmath import row_norms
from ..utils.validation import check_is_fitted, _deprecate_positional_args
from ..utils._feature_names import _make_feature_names
from ..exceptions import ConvergenceWarning
from . import AgglomerativeClustering

Expand Down Expand Up @@ -656,3 +657,20 @@ def _global_clustering(self, X=None):

if compute_labels:
self.labels_ = self.predict(X)

def get_feature_names(self, input_features=None):
"""Get output feature names.

Parameters
----------
input_features : list of string or None
String names of the input features.

Returns
-------
output_feature_names : list of string
Feature names for transformer output.
"""
return _make_feature_names(
n_features=self.subcluster_centers_.shape[0],
prefix=type(self).__name__.lower())
35 changes: 35 additions & 0 deletions sklearn/cluster/_kmeans.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
from ..utils import check_random_state
from ..utils.validation import check_is_fitted, _check_sample_weight
from ..utils._openmp_helpers import _openmp_effective_n_threads
from ..utils._feature_names import _make_feature_names
from ..exceptions import ConvergenceWarning
from ._k_means_fast import _inertia_dense
from ._k_means_fast import _inertia_sparse
Expand Down Expand Up @@ -1215,6 +1216,23 @@ def score(self, X, y=None, sample_weight=None):
return -_labels_inertia(X, sample_weight, x_squared_norms,
self.cluster_centers_)[1]

def get_feature_names(self, input_features=None):
"""Get output feature names.

Parameters
----------
input_features : list of string or None
String names of the input features.

Returns
-------
output_feature_names : list of string
Feature names for transformer output.
"""
return _make_feature_names(
n_features=self.n_clusters,
prefix=type(self).__name__.lower())


def _mini_batch_step(X, sample_weight, x_squared_norms, centers, weight_sums,
old_center_buffer, compute_squared_diff,
Expand Down Expand Up @@ -1871,3 +1889,20 @@ def predict(self, X, sample_weight=None):

X = self._check_test_data(X)
return self._labels_inertia_minibatch(X, sample_weight)[0]

def get_feature_names(self, input_features=None):
"""Get output feature names.

Parameters
----------
input_features : list of string or None
String names of the input features.

Returns
-------
output_feature_names : list of string
Feature names for transformer output.
"""
return _make_feature_names(
n_features=self.n_clusters,
prefix=type(self).__name__.lower())
6 changes: 5 additions & 1 deletion sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -371,8 +371,12 @@ def get_feature_names(self):
raise AttributeError("Transformer %s (type %s) does not "
"provide get_feature_names."
% (str(name), type(trans).__name__))
try:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ducktyping to support both transformative and non-transformative get_feature_names.

more_names = trans.get_feature_names(input_features=column)
except TypeError:
more_names = trans.get_feature_names()
feature_names.extend([name + "__" + f for f in
trans.get_feature_names()])
more_names])
return feature_names

def _update_fitted_transformers(self, transformers):
Expand Down
15 changes: 14 additions & 1 deletion sklearn/compose/tests/test_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler, Normalizer, OneHotEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline


class Trans(BaseEstimator):
Expand Down Expand Up @@ -660,6 +661,17 @@ def test_column_transformer_get_feature_names():
"Transformer trans (type Trans) does not provide "
"get_feature_names", ct.get_feature_names)

# if some transformers support and some don't
ct = ColumnTransformer([('trans', Trans(), [0, 1]),
('scale', StandardScaler(), [0])])
ct.fit(X_array)
assert_raise_message(AttributeError,
"Transformer trans (type Trans) does not provide "
"get_feature_names", ct.get_feature_names)

# inside a pipeline
make_pipeline(ct).fit(X_array)

# working example
X = np.array([[{'a': 1, 'b': 2}, {'a': 3, 'b': 4}],
[{'c': 5}, {'c': 6}]], dtype=object).T
Expand Down Expand Up @@ -1367,4 +1379,5 @@ def test_feature_names_empty_columns(empty_col):
)

ct.fit(df)
assert ct.get_feature_names() == ['ohe__x0_a', 'ohe__x0_b', 'ohe__x1_z']
assert ct.get_feature_names() == ['ohe__col1_a', 'ohe__col1_b',
'ohe__col2_z']
Loading
0