-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Feature names with input features #13307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ab2acbd
3bc674b
1c4a78f
7881930
de63353
8835f3b
6ca8b03
85db7bf
ddd0341
ba053ac
5da2207
58d65b1
8026d8d
6a61ed9
e0c0a54
968163b
3fd5f6d
d7c66e1
b330841
8da4ebd
533dac3
372eb71
66eb4e6
0d8dc70
7550aac
4287cb8
eb78eac
7fa6950
d373b87
4d4e6c6
003fcf3
f185af3
acc4c76
eef87b6
4ed56c8
8787e04
c057599
fca9ac2
d660f92
a74f4c4
eefe54c
ad48edf
4bbd8cd
8bf4960
eb9aa52
fe4a020
750906b
0ca6e9d
7cd3dd0
a85ab5e
8001cdb
fa00af0
ccfc971
2dae339
534c4ed
dc1c349
089c65d
cf86af0
bf1b1ad
bba7b8c
4c17e96
2733d20
bdf6cb7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -139,6 +139,32 @@ or by name:: | |
>>> pipe['reduce_dim'] | ||
PCA() | ||
|
||
To enable model inspection, `Pipeline` sets an ``input_features_`` attribute on | ||
all pipeline steps during fitting. This allows the user to understand how | ||
features are transformed during a pipeline:: | ||
|
||
>>> from sklearn.datasets import load_iris | ||
>>> from sklearn.feature_selection import SelectKBest | ||
>>> iris = load_iris() | ||
>>> pipe = Pipeline(steps=[ | ||
... ('select', SelectKBest(k=2)), | ||
... ('clf', LogisticRegression())]) | ||
>>> pipe.fit(iris.data, iris.target) | ||
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS | ||
Pipeline(memory=None, | ||
steps=[('select', SelectKBest(...)), ('clf', LogisticRegression(...))]) | ||
>>> pipe.named_steps.clf.input_features_ | ||
array(['x2', 'x3'], dtype='<U2') | ||
|
||
You can also provide custom feature names for a more human readable format using | ||
``get_feature_names``:: | ||
|
||
>>> pipe.get_feature_names(iris.feature_names) | ||
>>> pipe.named_steps.select.input_features_ | ||
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] | ||
>>> pipe.named_steps.clf.input_features_ | ||
array(['petal length (cm)', 'petal width (cm)'], dtype='<U17') | ||
Comment on lines
+162
to
+166
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Calling |
||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`sphx_glr_auto_examples_feature_selection_plot_feature_selection_pipeline.py` | ||
|
@@ -428,7 +454,7 @@ By default, the remaining rating columns are ignored (``remainder='drop'``):: | |
>>> from sklearn.feature_extraction.text import CountVectorizer | ||
>>> from sklearn.preprocessing import OneHotEncoder | ||
>>> column_trans = ColumnTransformer( | ||
... [('city_category', OneHotEncoder(dtype='int'),['city']), | ||
... [('categories', OneHotEncoder(dtype='int'),['city']), | ||
... ('title_bow', CountVectorizer(), 'title')], | ||
... remainder='drop') | ||
|
||
|
@@ -438,11 +464,11 @@ By default, the remaining rating columns are ignored (``remainder='drop'``):: | |
('title_bow', CountVectorizer(), 'title')]) | ||
|
||
>>> column_trans.get_feature_names() | ||
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw', | ||
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his', | ||
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable', | ||
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson', | ||
'title_bow__wrath'] | ||
['categories__city_London', 'categories__city_Paris', | ||
'categories__city_Sallisaw', 'title_bow__bow', 'title_bow__feast', | ||
'title_bow__grapes', 'title_bow__his', 'title_bow__how', 'title_bow__last', | ||
'title_bow__learned', 'title_bow__moveable', 'title_bow__of', 'title_bow__the', | ||
'title_bow__trick', 'title_bow__watson', 'title_bow__wrath'] | ||
|
||
>>> column_trans.transform(X).toarray() | ||
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0], | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -145,6 +145,50 @@ | |
clf.fit(X_train, y_train) | ||
print("model score: %.3f" % clf.score(X_test, y_test)) | ||
|
||
|
||
############################################################################### | ||
# Inspecting the coefficients v 6D47 alues of the classifier | ||
############################################################################### | ||
# The coefficients of the final classification step of the pipeline gives an | ||
# idea how each feature impacts the likelihood of survival assuming that the | ||
# usual linear model assumptions hold (uncorrelated features, linear | ||
# separability, homoschedastic errors...) which we do not verify in this | ||
# example. | ||
# | ||
# To get error bars we perform cross-validation and compute the mean and | ||
# standard deviation for each coefficient accross CV splits. Because we use a | ||
# standard scaler on the numerical features, the coefficient weights gives us | ||
# an idea on how much the log odds of surviving are impacted by a change in | ||
# this dimension contrasted to the mean. Note that the categorical features | ||
# here are overspecified which makes it slightly harder to interpret because of | ||
# the information redundancy. | ||
# | ||
# We can see that the linear model coefficients are in agreement with the | ||
# historical reports: people in higher classes and therefore in the upper decks | ||
# were the first to reach the lifeboats, and often, priority was given to women | ||
# and children. | ||
# | ||
# Note that conditionned on the "pclass_x" one-hot features, the "fare" | ||
# numerical feature does not seem to be significantly predictive. If we drop | ||
# the "pclass" feature, then higher "fare" values would appear significantly | ||
# correlated with a higher likelihood of survival as the "fare" and "pclass" | ||
# features have a strong statistical dependency. | ||
|
||
import matplotlib.pyplot as plt | ||
from sklearn.model_selection import cross_validate | ||
from sklearn.model_selection import StratifiedShuffleSplit | ||
|
||
cv = StratifiedShuffleSplit(n_splits=20, test_size=0.25, random_state=42) | ||
cv_results = cross_validate(clf, X_train, y_train, cv=cv, | ||
return_estimator=True) | ||
cv_coefs = np.concatenate([cv_pipeline.named_steps["classifier"].coef_ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned IRL, I think that we could change the "last_estimator" of the pipeline to "last_estimator" and use it here so that this code snippet (which is a very common one) is independent of the name given to the last step:
and below
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do so, I guess There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A transformer is regarded as an estimator. I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we now have [-1] which I think is fine. |
||
for cv_pipeline in cv_results["estimator"]]) | ||
fig, ax = plt.subplots() | ||
ax.barh(clf.named_steps["classifier"].input_features_, | ||
cv_coefs.mean(axis=0), xerr=cv_coefs.std(axis=0)) | ||
plt.tight_layout() | ||
plt.show() | ||
|
||
############################################################################### | ||
# The resulting score is not exactly the same as the one from the previous | ||
# pipeline becase the dtype-based selector treats the ``pclass`` columns as | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,20 +6,23 @@ | |
import copy | ||
import warnings | ||
from collections import defaultdict | ||
|
||
import platform | ||
import inspect | ||
import re | ||
|
||
import numpy as np | ||
|
||
from . import __version__ | ||
from .exception import NotFittedError | ||
from ._config import get_config | ||
from .utils import _IS_32BIT | ||
from .utils.validation import check_X_y | ||
from .utils.validation import check_array | ||
from .utils._estimator_html_repr import estimator_html_repr | ||
from .utils.validation import _deprecate_positional_args | ||
|
||
|
||
_DEFAULT_TAGS = { | ||
'non_deterministic': False, | ||
'requires_positive_X': False, | ||
|
@@ -688,6 +691,49 @@ def fit_transform(self, X, y=None, **fit_params): | |
# fit method of arity 2 (supervised transformation) | ||
return self.fit(X, y, **fit_params).transform(X) | ||
|
||
def get_feature_names(self, input_features=None): | ||
"""Get output feature names. | ||
|
||
Parameters | ||
---------- | ||
input_features : list of string or None | ||
String names of the input features. | ||
|
||
Returns | ||
------- | ||
output_feature_names : list of string | ||
Feature names for transformer output. | ||
""" | ||
# OneToOneMixin is higher in the class hierarchy | ||
# because we put mixins on the wrong side | ||
if hasattr(super(), 'get_feature_names'): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use try-except? It's more efficient... |
||
return super().get_feature_names(input_features) | ||
# generate feature names from class name by default | ||
# would be much less guessing if we stored the number | ||
# of output features. | ||
# Ideally this would be done in each class. | ||
if hasattr(self, 'n_clusters'): | ||
# this is before n_components_ | ||
# because n_components_ means something else | ||
# in agglomerative clustering | ||
n_features = self.n_clusters | ||
elif hasattr(self, '_max_components'): | ||
# special case for LinearDiscriminantAnalysis | ||
n_components = self.n_components or np.inf | ||
n_features = min(self._max_components, n_components) | ||
elif hasattr(self, 'n_components_'): | ||
# n_components could be auto or None | ||
# this is more likely to be an int | ||
n_features = self.n_components_ | ||
elif hasattr(self, 'n_components') and self.n_components is not None: | ||
n_features = self.n_components | ||
elif hasattr(self, 'components_'): | ||
n_features = self.components_.shape[0] | ||
else: | ||
return None | ||
return ["{}{}".format(type(self).__name__.lower(), i) | ||
for i in range(n_features)] | ||
|
||
|
||
class DensityMixin: | ||
"""Mixin class for all density estimators in scikit-learn.""" | ||
|
@@ -736,10 +782,81 @@ def fit_predict(self, X, y=None): | |
return self.fit(X).predict(X) | ||
|
||
|
||
class OneToOneMixin(object): | ||
"""Provides get_feature_names for simple transformers | ||
|
||
Assumes there's a 1-to-1 correspondence between input features | ||
and output features. | ||
""" | ||
|
||
def get_feature_names(self, input_features=None): | ||
"""Get feature names for transformation. | ||
|
||
Returns input_features as this transformation | ||
doesn't add or drop features. | ||
|
||
Parameters | ||
---------- | ||
input_features : array-like of string | ||
Input feature names. | ||
|
||
Returns | ||
------- | ||
feature_names : array-like of string | ||
Transformed feature names | ||
""" | ||
if input_features is not None: | ||
return input_features | ||
else: | ||
raise ValueError("Don't know how to get" | ||
amueller marked this conversation as resolved.
Show resolved
Hide resolved
|
||
" input feature names for {}".format(self)) | ||
|
||
|
||
def _get_sub_estimators(est): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we solve this issue by adding a "sub_estimators_" attribute on all meta-estimators? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is that easier / better? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, it's the meta-estimator's responsibility to define what it's sub-estimators are. The list above should be modified when we add meta-estimators. It cannot be extended by other packages implementing meta-estimators. I find that in an object-oriented design, it is more natural that each class deals with specifying it's own functionality. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we add meta-estimators that don't obey our conventions a test will fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, there are several ways to delegate the responsibility to the meta-estimator, either via delegating to a method, or to an attribute. The attribute seemed pretty light to me. With regards to avoiding premature test failures, we can simple pass when sub_estimator_ is not present. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a new sub-estimator was added while this discussion was happening lol. I'm not sure I understand your last sentence. I don't really care that much in either direction here, it's an implementation detail that's not really important 1241 to the larger feature addition, I think. |
||
# Explicitly declare all fitted subestimators of existing meta-estimators | ||
sub_ests = [] | ||
# OHE is not really needed | ||
sub_names = ['estimator_', 'base_estimator_', 'one_hot_encoder_', | ||
'best_estimator_', 'init_'] | ||
for name in sub_names: | ||
sub_est = getattr(est, name, None) | ||
if sub_est is not None: | ||
sub_ests.append(sub_est) | ||
if hasattr(est, "estimators_"): | ||
if hasattr(est.estimators_, 'shape'): | ||
sub_ests.extend(est.estimators_.ravel()) | ||
else: | ||
sub_ests.extend(est.estimators_) | ||
return sub_ests | ||
|
||
|
||
class MetaEstimatorMixin: | ||
_required_parameters = ["estimator"] | ||
"""Mixin class for all meta estimators in scikit-learn.""" | ||
|
||
def get_feature_names(self, input_features=None): | ||
"""Ensure feature names are set on sub-estimators | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Parameters | ||
---------- | ||
input_features : list of string or None | ||
Input features to the meta-estimator. | ||
""" | ||
sub_ests = _get_sub_estimators(self) | ||
for est in sub_ests: | ||
est.input_features_ = input_features | ||
if hasattr(est, "get_feature_names"): | ||
# doing hassattr instead of a try-except on everything | ||
# b/c catching AttributeError makes recursive code | ||
# impossible to debug | ||
try: | ||
est.get_feature_names(input_features=input_features) | ||
except TypeError: | ||
# do we need this? | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
est.get_feature_names() | ||
except NotFittedError: | ||
pass | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can a meta-estimator sometimes be a transformer? If so, we should return feature names here in that specific case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SelectFromModel is both |
||
|
||
|
||
class MultiOutputMixin: | ||
"""Mixin to mark estimators that support multioutput.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returning nothing is strange. It's more of "set_input_feature_names".