scikit-learn · glemaitre · Mar 13, 2020 · Mar 12, 2020 · Mar 12, 2020 · Mar 12, 2020
diff --git a/examples/inspection/plot_linear_model_coefficient_interpretation.py b/examples/inspection/plot_linear_model_coefficient_interpretation.py
@@ -3,9 +3,9 @@
 Common pitfalls in interpretation of coefficients of linear models
 ==================================================================
 
-Linear models describe situations in which the target value is expected to be
+In linear models, the target value is modeled as
 a linear combination of the features (see the :ref:`linear_model` User Guide
-section for a description of a set of linear model methods available in
+section for a description of a set of linear models available in
 scikit-learn).
 Coefficients in multiple linear models represent the relationship between the
 given feature, :math:`X_i` and the target, :math:`y`, assuming that all the
@@ -56,8 +56,9 @@
 X.describe(include="all")
 
 ##############################################################################
-# Notice that the dataset contains categorical and numerical variables.
-# This will give us directions on how to preprocess the data thereafter.
+# Note that the dataset contains categorical and numerical variables.
+# We will need to take this into account when preprocessing the dataset
+# thereafter.
 
 X.head()
 
@@ -92,21 +93,24 @@
 _ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
 
 ##############################################################################
-# Looking closely at the WAGE distribution it could be noticed that it has a
-# long tail and we could take its logarithm
-# to simplify our problem and approximate a normal distribution.
+# Looking closely at the WAGE distribution reveals that it has a
+# long tail. For this reason, we should take its logarithm
+# to turn it approximately into a normal distribution (linear models such
+# as ridge or lasso work best for a normal distribution of error).
+#
 # The WAGE is increasing when EDUCATION is increasing.
-# It should be noted that the dependence between WAGE and EDUCATION
-# represented here is a marginal dependence, i.e., it describe the behavior
-# of a specific variable without fixing the others.
-# Also, the EXPERIENCE and AGE are linearly correlated.
+# Note that the dependence between WAGE and EDUCATION
+# represented here is a marginal dependence, i.e., it describes the behavior
+# of a specific variable without keeping the others fixed.
+#
+# Also, the EXPERIENCE and AGE are strongly linearly correlated.
 #
 # .. _the-pipeline:
 #
 # The machine-learning pipeline
 # -----------------------------
 #
-# To design our machine-learning pipeline, we manually
+# To design our machine-learning pipeline, we first manually
 # check the type of data that we are dealing with:
 
 survey.data.info()
@@ -137,7 +141,7 @@
 )
 
 ##############################################################################
-# To describe the dataset as a linear model we choose to use a ridge regressor
+# To describe the dataset as a linear model we use a ridge regressor
 # with a very small regularization and to model the logarithm of the WAGE.
 
 
@@ -190,11 +194,11 @@
 # The model learnt is far from being a good model making accurate predictions:
 # this is obvious when looking at the plot above, where good predictions
 # should lie on the red line.
+#
 # In the following section, we will interpret the coefficients of the model.
-# While we do so, we should keep in mind that any conclusion we way draw will
-# be about
-# the model that we build, rather than about the true (real-world) generative
-# process of the data.
+# While we do so, we should keep in mind that any conclusion we draw is
+# about the model that we build, rather than about the true (real-world)
+# generative process of the data.
 #
 # Interpreting coefficients: scale matters
 # ---------------------------------------------
@@ -218,7 +222,7 @@
 ##############################################################################
 # The AGE coefficient is expressed in "dollars/hour per living years" while the
 # EDUCATION one is expressed in "dollars/hour per years of education". This
-# representation of the coefficients has the advantage of making clear the
+# representation of the coefficients has the benefit of making clear the
 # practical predictions of the model: an increase of :math:`1` year in AGE
 # means a decrease of :math:`0.030867` dollars/hour, while an increase of
 # :math:`1` year in EDUCATION means an increase of :math:`0.054699`
@@ -227,7 +231,7 @@
 # are expressed in dollars/hour. Then, we cannot compare the magnitude of
 # different coefficients since the features have different natural scales, and
 # hence value ranges, because of their different unit of measure. This is more
-# evident if we plot the coefficients.
+# visible if we plot the coefficients.
 
 coefs.plot(kind='barh', figsize=(9, 7))
 plt.title('Ridge model, small regularization')
@@ -237,12 +241,15 @@
 ###############################################################################
 # Indeed, from the plot above the most important factor in determining WAGE
 # appears to be the
-# variable UNION, even if it is plausible that variables like EXPERIENCE
-# should have more impact.
-# Looking at the coefficient plot to extrapolate feature importance could be
+# variable UNION, even if our intuition might tell us that variables
+# like EXPERIENCE should have more impact.
+#
+# Looking at the coefficient plot to gauge feature importance can be
 # misleading as some of them vary on a small scale, while others, like AGE,
 # varies a lot more, several decades.
-# This is evident if we compare feature standard deviations.
+#
+# This is visible if we compare the standard deviations of different
+# features.
 
 X_train_preprocessed = pd.DataFrame(
     model.named_steps['columntransformer'].transform(X_train),
@@ -296,8 +303,11 @@
 # Checking the variability of the coefficients
 # --------------------------------------------
 #
-# We can check the coefficient variability through cross-validation.
-# If coefficients vary in a significant way changing the input dataset
+# We can check the coefficient variability through cross-validation:
+# it is a form of data perturbation (related to
+# `resampling <https://en.wikipedia.org/wiki/Resampling_(statistics)>`_).
+#
+# If coefficients vary significantly when changing the input dataset
 # their robustness is not guaranteed, and they should probably be interpreted
 # with caution.
 
@@ -330,6 +340,7 @@
 # might be due to the collinearity between the 2 features: as AGE and
 # EXPERIENCE vary together in the data, their effect is difficult to tease
 # apart.
+#
 # To verify this interpretation we plot the variability of the AGE and
 # EXPERIENCE coefficient.
 #
@@ -446,7 +457,7 @@
 plt.subplots_adjust(left=.3)
 
 ##############################################################################
-# We cross validate the coefficients.
+# We now inspect the coefficients across several cross-validation folds.
 
 cv_model = cross_validate(
     model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
@@ -472,11 +483,12 @@
 #
 # In machine-learning practice, Ridge Regression is more often used with
 # non-negligible regularization.
+#
 # Above, we limited this regularization to a very little amount.
 # Regularization improves the conditioning of the problem and reduces the
 # variance of the estimates. RidgeCV applies cross validation in order to
 # determine which value of the regularization parameter (`alpha`) is best
-# suited for the model estimation.
+# suited for prediction.
 
 from sklearn.linear_model import RidgeCV
 
@@ -492,7 +504,7 @@
 _ = model.fit(X_train, y_train)
 
 ##############################################################################
-# First we verify which value of :math:`\alpha` has been selected.
+# First we check which value of :math:`\alpha` has been selected.
 
 model[-1].regressor_.alpha_
 
@@ -533,15 +545,18 @@
 
 ##############################################################################
 # The coefficients are significantly different.
-# AGE and EXPERIENCE coefficients are both positive but they have less
+# AGE and EXPERIENCE coefficients are both positive but they now have less
 # influence on the prediction.
-# The regularization manages to lower the influence of correlated
+#
+# The regularization reduces the influence of correlated
 # variables on the model because the weight is shared between the two
-# predictive variables, so neither alone would be very strongly weighted.
-# On the other hand, those weights are more robust with respect to
-# cross validation (see the :ref:`ridge_regression` User Guide section),
-# as is shown in the plot below to be compared with the
-# :ref:`previous one<covariation>`.
+# predictive variables, so neither alone would have strong weights.
+#
+# On the other hand, the weights obtained with regularization are more
+# stable  (see the :ref:`ridge_regression` User Guide section). This
+# increased stability is visible from the plot, obtained from data
+# perturbations, in a cross validation. This plot can  be compared with
+# the :ref:`previous one<covariation>`.
 
 cv_model = cross_validate(
     model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
@@ -632,14 +647,25 @@
 # A Lasso model identifies the correlation between
 # AGE and EXPERIENCE and suppresses one of them for the sake of the prediction.
 #
+# It is important to keep in mind that the coefficients that have been
+# dropped may still be related to the outcome by themselves: the model
+# chose to suppress them because they bring little or no additional
+# information on top of the other features. Additionnaly, this selection
+# is unstable for correlated features, and should be interpreted with
+# caution.
+#
 # Lessons learned
 # ---------------
 #
-# * Feature importance could be extrapolated from the coefficients only after
-#   having scaled them to the same unit of measure.
-# * Coefficients in multiple linear models represent conditional dependencies
-#   between a given feature and the target.
-# * Correlated features induce variability in the coefficients of linear
-#   models.
+# * Coefficients must be scaled to the same unit of measure to retrieve
+#   feature importance. Scaling them with the standard-deviation of the
+#   feature is a useful proxy.
+# * Coefficients in multivariate linear models represent the dependency
+#   between a given feature and the target, **conditional** on the other
+#   features.
+# * Correlated features induce instabilities in the coefficients of linear
+#   models and their effects cannot be well teased apart.
 # * Different linear models respond differently to feature correlation and
 #   coefficients could significantly vary from one another.
+# * Inspecting coefficients across the folds of a cross-validation loop
+#   gives an idea of their stability.