8000 DOC: wording in linear model interpretation by GaelVaroquaux · Pull Request #16680 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC: wording in linear model interpretation #16680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 13, 2020
108 changes: 67 additions & 41 deletions examples/inspection/plot_linear_model_coefficient_interpretation.py
10000
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Common pitfalls in interpretation of coefficients of linear models
==================================================================

Linear models describe situations in which the target value is expected to be
In linear models, the target value is modeled as
a linear combination of the features (see the :ref:`linear_model` User Guide
section for a description of a set of linear model methods available in
section for a description of a set of linear models available in
scikit-learn).
Coefficients in multiple linear models represent the relationship between the
given feature, :math:`X_i` and the target, :math:`y`, assuming that all the
Expand Down Expand Up @@ -56,8 +56,9 @@
X.describe(include="all")

##############################################################################
# Notice that the dataset contains categorical and numerical variables.
# This will give us directions on how to preprocess the data thereafter.
# Note that the dataset contains categorical and numerical variables.
# We will need to take this into account when preprocessing the dataset
# thereafter.

X.head()

Expand Down Expand Up @@ -92,21 +93,24 @@
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

##############################################################################
# Looking closely at the WAGE distribution it could be noticed that it has a
# long tail and we could take its logarithm
# to simplify our problem and approximate a normal distribution.
# Looking closely at the WAGE distribution reveals that it has a
# long tail. For this reason, we should take its logarithm
# to turn it approximately into a normal distribution (linear models such
# as ridge or lasso work best for a normal distribution of error).
#
# The WAGE is increasing when EDUCATION is increasing.
# It should be noted that the dependence between WAGE and EDUCATION
# represented here is a marginal dependence, i.e., it describe the behavior
# of a specific variable without fixing the others.
# Also, the EXPERIENCE and AGE are linearly correlated.
# Note that the dependence between WAGE and EDUCATION
# represented here is a marginal dependence, i.e., it describes the behavior
# of a specific variable without keeping the others fixed.
#
# Also, the EXPERIENCE and AGE are strongly linearly correlated.
#
# .. _the-pipeline:
#
# The machine-learning pipeline
# -----------------------------
#
# To design our machine-learning pipeline, we manually
# To design our machine-learning pipeline, we first manually
# check the type of data that we are dealing with:

survey.data.info()
Expand Down Expand Up @@ -137,7 +141,7 @@
)

##############################################################################
# To describe the dataset as a linear model we choose to use a ridge regressor
# To describe the dataset as a linear model we use a ridge regressor
# with a very small regularization and to model the logarithm of the WAGE.


Expand Down Expand Up @@ -190,11 +194,11 @@
# The model learnt is far from being a good model making accurate predictions:
# this is obvious when looking at the plot above, where good predictions
# should lie on the red line.
#
# In the following section, we will interpret the coefficients of the model.
# While we do so, we should keep in mind that any conclusion we way draw will
# be about
# the model that we build, rather than about the true (real-world) generative
# process of the data.
# While we do so, we should keep in mind that any conclusion we draw is
# about the model that we build, rather than about the true (real-world)
# generative process of the data.
#
# Interpreting coefficients: scale matters
# ---------------------------------------------
Expand All @@ -218,7 +222,7 @@
##############################################################################
# The AGE coefficient is expressed in "dollars/hour per living years" while the
# EDUCATION one is expressed in "dollars/hour per years of education". This
# representation of the coefficients has the advantage of making clear the
# representation of the coefficients has the benefit of making clear the
# practical predictions of the model: an increase of :math:`1` year in AGE
# means a decrease of :math:`0.030867` dollars/hour, while an increase of
# :math:`1` year in EDUCATION means an increase of :math:`0.054699`
Expand All @@ -227,7 +231,7 @@
# are expressed in dollars/hour. Then, we cannot compare the magnitude of
# different coefficients since the features have different natural scales, and
# hence value ranges, because of their different unit of measure. This is more
# evident if we plot the coefficients.
# visible if we plot the coefficients.

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model, small regularization')
Expand All @@ -237,12 +241,15 @@
###############################################################################
# Indeed, from the plot above the most important factor in determining WAGE
# appears to be the
# variable UNION, even if it is plausible that variables like EXPERIENCE
# should have more impact.
# Looking at the coefficient plot to extrapolate feature importance could be
# variable UNION, even if our intuition might tell us that variables
# like EXPERIENCE should have more impact.
#
# Looking at the coefficient plot to gauge feature importance can be
# misleading as some of them vary on a small scale, while others, like AGE,
# varies a lot more, several decades.
# This is evident if we compare feature standard deviations.
#
# This is visible if we compare the standard deviations of different
# features.

X_train_preprocessed = pd.DataFrame(
model.named_steps['columntransformer'].transform(X_train),
Expand Down Expand Up @@ -296,8 +303,11 @@
# Checking the variability of the coefficients
# --------------------------------------------
#
# We can check the coefficient variability through cross-validation.
# If coefficients vary in a significant way changing the input dataset
# We can check the coefficient variability through cross-validation:
# it is a form of data perturbation (related to
# `resampling <https://en.wikipedia.org/wiki/Resampling_(statistics)>`_).
#
# If coefficients vary significantly when changing the input dataset
# their robustness is not guaranteed, and they should probably be interpreted
# with caution.

Expand Down Expand Up @@ -330,6 +340,7 @@
# might be due to the collinearity between the 2 features: as AGE and
# EXPERIENCE vary together in the data, their effect is difficult to tease
# apart.
#
# To verify this interpretation we plot the variability of the AGE and
# EXPERIENCE coefficient.
#
Expand Down Expand Up @@ -446,7 +457,7 @@
plt.subplots_adjust(left=.3)

##############################################################################
# We cross validate the coefficients.
# We now inspect the coefficients across several cross-validation folds.

cv_model = cross_validate(
model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
Expand All @@ -472,11 +483,12 @@
#
# In machine-learning practice, Ridge Regression is more often used with
# non-negligible regularization.
#
# Above, we limited this regularization to a very little amount.
# Regularization improves the conditioning of the problem and reduces the
# variance of the estimates. RidgeCV applies cross validation in order to
# determine which value of the regularization parameter (`alpha`) is best
# suited for the model estimation.
# suited for prediction.

from sklearn.linear_model import RidgeCV

Expand All @@ -492,7 +504,7 @@
_ = model.fit(X_train, y_train)

##############################################################################
# First we verify which value of :math:`\alpha` has been selected.
# First we check which value of :math:`\alpha` has been selected.

model[-1].regressor_.alpha_

Expand Down Expand Up @@ -533,15 +545,18 @@

##############################################################################
# The coefficients are significantly different.
# AGE and EXPERIENCE coefficients are both positive but they have less
# AGE and EXPERIENCE coefficients are both positive but they now have less
# influence on the prediction.
# The regularization manages to lower the influence of correlated
#
# The regularization reduces the influence of correlated
# variables on the model because the weight is shared between the two
# predictive variables, so neither alone would be very strongly weighted.
# On the other hand, those weights are more robust with respect to
# cross validation (see the :ref:`ridge_regression` User Guide section),
# as is shown in the plot below to be compared with the
# :ref:`previous one<covariation>`.
# predictive variables, so neither alone would have strong weights.
#
# On the other hand, the weights obtained with regularization are more
# stable (see the :ref:`ridge_regression` User Guide section). This
# increased stability is visible from the plot, obtained from data
# perturbations, in a cross validation. This plot can be compared with
# the :ref:`previous one<covariation>`.

cv_model = cross_validate(
model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
Expand Down Expand Up @@ -632,14 +647,25 @@
# A Lasso model identifies the correlation between
# AGE and EXPERIENCE and suppresses one of them for the sake of the prediction.
#
# It is important to keep in mind that the coefficients that have been
# dropped may still be related to the outcome by themselves: the model
# chose to suppress them because they bring little or no additional
# information on top of the other features. Additionnaly, this selection
# is unstable for correlated features, and should be interpreted with
# caution.
#
# Lessons learned
# ---------------
#
# * Feature importance could be extrapolated from the coefficients only after
# having scaled them to the same unit of measure.
# * Coefficients in multiple linear models represent conditional dependencies
# between a given feature and the target.
# * Correlated features induce variability in the coefficients of linear
# models.
# * Coefficients must be scaled to the same unit of measure to retrieve
# feature importance. Scaling them with the standard-deviation of the
# feature is a useful proxy.
# * Coefficients in multivariate linear models represent the dependency
# between a given feature and the target, **conditional** on the other
# features.
# * Correlated features induce instabilities in the coefficients of linear
# models and their effects cannot be well teased apart.
# * Different linear models respond differently to feature correlation and
# coefficients could significantly vary from one another.
# * Inspecting coefficients across the folds of a cross-validation loop
# gives an idea of their stability.
0