8000 DOC: wording in linear model interpretation by GaelVaroquaux · Pull Request #16680 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
108 changes: 67 additions & 41 deletions examples/inspection/plot_linear_model_coefficient_interpretation.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
Common pitfalls in interpretation of coefficients of linear models
==================================================================

Linear models describe situations in which the target value is expected to be
In linear models, the target value is modeled as
a linear combination of the features (see the :ref:`linear_model` User Guide
section for a description of a set of linear model methods available in
section for a description of a set of linear models available in
scikit-learn).
Coefficients in multiple linear models represent the relationship between the
given feature, :math:`X_i` and the target, :math:`y`, assuming that all the
Expand Down Expand Up @@ -56,8 +56,9 @@
X.describe(include="all")

##############################################################################
# Notice that the dataset contains categorical and numerical variables.
# This will give us directions on how to preprocess the data thereafter.
# Note that the dataset contains categorical and numerical variables.
# We will need to take this into account when preprocessing the dataset
# thereafter.

X.head()

Expand Down Expand Up @@ -92,21 +93,24 @@
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

##############################################################################
# Looking closely at the WAGE distribution it could be noticed that it has a
# long tail and we could take its logarithm
# to simplify our problem and approximate a normal distribution.
# Looking closely at the WAGE distribution reveals that it has a
# long tail. For this reason, we should take its logarithm
# to turn it approximately into a normal distribution (linear models such
# as ridge or lasso work best for a normal distribution of error).
#
# The WAGE is increasing when EDUCATION is increasing.
# It should be noted that the dependence between WAGE and EDUCATION
# represented here is a marginal dependence, i.e., it describe the behavior
# of a specific variable without fixing the others.
# Also, the EXPERIENCE and AGE are linearly correlated.
# Note that the dependence between WAGE and EDUCATION
# represented here is a marginal dependence, i.e., it describes the behavior
# of a specific variable without keeping the others fixed.
#
# Also, the EXPERIENCE and AGE are strongly linearly correlated.
#
# .. _the-pipeline:
#
# The machine-learning pipeline
# -----------------------------
#
# To design our machine-learning pipeline, we manually
# To design our machine-learning pipeline, we first manually
# check the type of data that we are dealing with:

survey.data.info()
Expand Down Expand Up @@ -137,7 +141,7 @@
)

##############################################################################
# To describe the dataset as a linear model we choose to use a ridge regressor
# To describe the dataset as a linear model we use a ridge regressor
# with a very small regularization and to model the logarithm of the WAGE.


Expand Down Expand Up @@ -190,11 +194,11 @@
# The model learnt is far from being a good model making accurate predictions:
# this is obvious when looking at the plot above, where good predictions
# should lie on the red line.
#
# In the following section, we will interpret the coefficients of the model.
# While we do so, we should keep in mind that any conclusion we way draw will
# be about
# the model that we build, rather than about the true (real-world) generative
# process of the data.
# While we do so, we should keep in mind that any conclusion we draw is
# about the model that we build, rather than about the true (real-world)
# generative process of the data.
#
# Interpreting coefficients: scale matters
# ---------------------------------------------
Expand All @@ -218,7 +222,7 @@
##############################################################################
# The AGE coefficient is expressed in "dollars/hour per living years" while the
# EDUCATION one is expressed in "dollars/hour per years of education". This
# representation of the coefficients has the advantage of making clear the
# representation of the coefficients has the benefit of making clear the
# practical predictions of the model: an increase of :math:`1` year in AGE
# means a decrease of :math:`0.030867` dollars/hour, while an increase of
# :math:`1` year in EDUCATION means an increase of :math:`0.054699`
Expand All @@ -227,7 +231,7 @@
# are expressed in dollars/hour. Then, we cannot compare the magnitude of
# different coefficients since the features have different natural scales, and
# hence value ranges, because of their different unit of measure. This is more
# evident if we plot the coefficients.
# visible if we plot the coefficients.

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model, small regularization')
Expand All @@ -237,12 +241,15 @@
###############################################################################
# Indeed, from the plot above the most important factor in determining WAGE
# appears to be the
# variable UNION, even if it is plausible that variables like EXPERIENCE
# should have more impact.
# Looking at the coefficient plot to extrapolate feature importance could be
# variable UNION, even if our intuition might tell us that variables
# like EXPERIENCE should have more impact.
#
# Looking at the coefficient plot to gauge feature importance can be
# misleading as some of them vary on a small scale, while others, like AGE,
# varies a lot more, several decades.
# This is evident if we compare feature standard deviations.
#
# This is visible if we compare the standard deviations of different
# features.

X_train_preprocessed = pd.DataFrame(
model.named_steps['columntransformer'].transform(X_train),
Expand Down Expand Up @@ -296,8 +303,11 @@
# Checking the variability of the coefficients
# --------------------------------------------
#
# We can check the coefficient variability through cross-validation.
# If coefficients vary in a significant way changing the input dataset
# We can check the coefficient variability through cross-validation:
# it is a form of data perturbation (related to
# `resampling <https://en.wikipedia.org/wiki/Resampling_(statistics)>`_).
#
# If coefficients vary significantly when changing the input dataset
# their robustness is not guaranteed, and they should probably be interpreted
# with caution.

Expand Down Expand Up @@ -330,6 +340,7 @@
# might be due to the collinearity between the 2 features: as AGE and
# EXPERIENCE vary together in the data, their effect is difficult to tease
# apart.
#
# To verify this interpretation we plot the variability of the AGE and
# EXPERIENCE coefficient.
#
Expand Down Expand Up @@ -446,7 +457,7 @@
plt.subplots_adjust(left=.3)

##############################################################################
# We cross validate the coefficients.
# We now inspect the coefficients across several cross-validation folds.

cv_model = cross_validate(
model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
Expand All @@ -472,11 +483,12 @@
#
# In machine-learning practice, Ridge Regression is more often used with
# non-negligible regularization.
#
# Above, we limited this regularization to a very little amount.
# Regularization improves the conditioning of the problem and reduces the
# variance of the estimates. RidgeCV applies cross validation in order to
# determine which value of the regularization parameter (`alpha`) is best
# suited for the model estimation.
# suited for prediction.

from sklearn.linear_model import RidgeCV

Expand All @@ -492,7 +504,7 @@
_ = model.fit(X_train, y_train)

##############################################################################
# First we verify which value of :math:`\alpha` has been selected.
# First we check which value of :math:`\alpha` has been selected.

model[-1].regressor_.alpha_

Expand Down Expand Up @@ -533,15 +545,18 @@

##############################################################################
# The coefficients are significantly different.
# AGE and EXPERIENCE coefficients are both positive but they have less
# AGE and EXPERIENCE coefficients are both positive but they now have less
# influence on the prediction.
# The regularization manages to lower the influence of correlated
#
# The regularization reduces the influence of correlated
# variables on the model because the weight is shared between the two
# predictive variables, so neither alone would be very strongly weighted.
# On the other hand, those weights are more robust with respect to
# cross validation (see the :ref:`ridge_regression` User Guide section),
# as is shown in the plot below to be compared with the
# :ref:`previous one<covariation>`.
# predictive variables, so neither alone would have strong weights.
#
# On the other hand, the weights obtained with regularization are more
# stable (see the :ref:`ridge_regression` User Guide section). This
# increased stability is visible from the plot, obtained from data
# perturbations, in a cross validation. This plot can be compared with
# the :ref:`previous one<covariation>`.

cv_model = cross_validate(
model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
Expand Down Expand Up @@ -632,14 +647,25 @@
# A Lasso model identifies the correlation between
# AGE and EXPERIENCE and suppresses one of them for the sake of the prediction.
#
# It is important to keep in mind that the coefficients that have been
# dropped may still be related to the outcome by themselves: the model
# chose to suppress them because they bring little or no additional
# information on top of the other features. Additionnaly, this selection
# is unstable for correlated features, and should be interpreted with
# caution.
#
# Lessons learned
# ---------------
#
# * Feature importance could be extrapolated from the coefficients only after
# having scaled them to the same unit of measure.
# * Coefficients in multiple linear models represent conditional dependencies
# between a given feature and the target.
# * Correlated features induce variability in the coefficients of linear
# models.
# * Coefficients must be scaled to the same unit of measure to retrieve
# feature importance. Scaling them with the standard-deviation of the
# feature is a useful proxy.
# * Coefficients in multivariate linear models represent the dependency
# between a given feature and the target, **conditional** on the other
# features.
# * Correlated features induce instabilities in the coefficients of linear
# models and their effects cannot be well teased apart.
# * Different linear models respond differently to feature correlation and
# coefficients could significantly vary from one another.
# * Inspecting coefficients across the folds of a cross-validation loop
# gives an idea of their stability.
0