diff --git a/examples/inspection/plot_linear_model_coefficient_interpretation.py b/examples/inspection/plot_linear_model_coefficient_interpretation.py index 9dc8f823ae8bd..7583bfa0a052f 100644 --- a/examples/inspection/plot_linear_model_coefficient_interpretation.py +++ b/examples/inspection/plot_linear_model_coefficient_interpretation.py @@ -3,9 +3,9 @@ Common pitfalls in interpretation of coefficients of linear models ================================================================== -Linear models describe situations in which the target value is expected to be +In linear models, the target value is modeled as a linear combination of the features (see the :ref:`linear_model` User Guide -section for a description of a set of linear model methods available in +section for a description of a set of linear models available in scikit-learn). Coefficients in multiple linear models represent the relationship between the given feature, :math:`X_i` and the target, :math:`y`, assuming that all the @@ -56,8 +56,9 @@ X.describe(include="all") ############################################################################## -# Notice that the dataset contains categorical and numerical variables. -# This will give us directions on how to preprocess the data thereafter. +# Note that the dataset contains categorical and numerical variables. +# We will need to take this into account when preprocessing the dataset +# thereafter. X.head() @@ -92,21 +93,24 @@ _ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde') ############################################################################## -# Looking closely at the WAGE distribution it could be noticed that it has a -# long tail and we could take its logarithm -# to simplify our problem and approximate a normal distribution. +# Looking closely at the WAGE distribution reveals that it has a +# long tail. For this reason, we should take its logarithm +# to turn it approximately into a normal distribution (linear models such +# as ridge or lasso work best for a normal distribution of error). +# # The WAGE is increasing when EDUCATION is increasing. -# It should be noted that the dependence between WAGE and EDUCATION -# represented here is a marginal dependence, i.e., it describe the behavior -# of a specific variable without fixing the others. -# Also, the EXPERIENCE and AGE are linearly correlated. +# Note that the dependence between WAGE and EDUCATION +# represented here is a marginal dependence, i.e., it describes the behavior +# of a specific variable without keeping the others fixed. +# +# Also, the EXPERIENCE and AGE are strongly linearly correlated. # # .. _the-pipeline: # # The machine-learning pipeline # ----------------------------- # -# To design our machine-learning pipeline, we manually +# To design our machine-learning pipeline, we first manually # check the type of data that we are dealing with: survey.data.info() @@ -137,7 +141,7 @@ ) ############################################################################## -# To describe the dataset as a linear model we choose to use a ridge regressor +# To describe the dataset as a linear model we use a ridge regressor # with a very small regularization and to model the logarithm of the WAGE. @@ -190,11 +194,11 @@ # The model learnt is far from being a good model making accurate predictions: # this is obvious when looking at the plot above, where good predictions # should lie on the red line. +# # In the following section, we will interpret the coefficients of the model. -# While we do so, we should keep in mind that any conclusion we way draw will -# be about -# the model that we build, rather than about the true (real-world) generative -# process of the data. +# While we do so, we should keep in mind that any conclusion we draw is +# about the model that we build, rather than about the true (real-world) +# generative process of the data. # # Interpreting coefficients: scale matters # --------------------------------------------- @@ -218,7 +222,7 @@ ############################################################################## # The AGE coefficient is expressed in "dollars/hour per living years" while the # EDUCATION one is expressed in "dollars/hour per years of education". This -# representation of the coefficients has the advantage of making clear the +# representation of the coefficients has the benefit of making clear the # practical predictions of the model: an increase of :math:`1` year in AGE # means a decrease of :math:`0.030867` dollars/hour, while an increase of # :math:`1` year in EDUCATION means an increase of :math:`0.054699` @@ -227,7 +231,7 @@ # are expressed in dollars/hour. Then, we cannot compare the magnitude of # different coefficients since the features have different natural scales, and # hence value ranges, because of their different unit of measure. This is more -# evident if we plot the coefficients. +# visible if we plot the coefficients. coefs.plot(kind='barh', figsize=(9, 7)) plt.title('Ridge model, small regularization') @@ -237,12 +241,15 @@ ############################################################################### # Indeed, from the plot above the most important factor in determining WAGE # appears to be the -# variable UNION, even if it is plausible that variables like EXPERIENCE -# should have more impact. -# Looking at the coefficient plot to extrapolate feature importance could be +# variable UNION, even if our intuition might tell us that variables +# like EXPERIENCE should have more impact. +# +# Looking at the coefficient plot to gauge feature importance can be # misleading as some of them vary on a small scale, while others, like AGE, # varies a lot more, several decades. -# This is evident if we compare feature standard deviations. +# +# This is visible if we compare the standard deviations of different +# features. X_train_preprocessed = pd.DataFrame( model.named_steps['columntransformer'].transform(X_train), @@ -296,8 +303,11 @@ # Checking the variability of the coefficients # -------------------------------------------- # -# We can check the coefficient variability through cross-validation. -# If coefficients vary in a significant way changing the input dataset +# We can check the coefficient variability through cross-validation: +# it is a form of data perturbation (related to +# `resampling `_). +# +# If coefficients vary significantly when changing the input dataset # their robustness is not guaranteed, and they should probably be interpreted # with caution. @@ -330,6 +340,7 @@ # might be due to the collinearity between the 2 features: as AGE and # EXPERIENCE vary together in the data, their effect is difficult to tease # apart. +# # To verify this interpretation we plot the variability of the AGE and # EXPERIENCE coefficient. # @@ -446,7 +457,7 @@ plt.subplots_adjust(left=.3) ############################################################################## -# We cross validate the coefficients. +# We now inspect the coefficients across several cross-validation folds. cv_model = cross_validate( model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), @@ -472,11 +483,12 @@ # # In machine-learning practice, Ridge Regression is more often used with # non-negligible regularization. +# # Above, we limited this regularization to a very little amount. # Regularization improves the conditioning of the problem and reduces the # variance of the estimates. RidgeCV applies cross validation in order to # determine which value of the regularization parameter (`alpha`) is best -# suited for the model estimation. +# suited for prediction. from sklearn.linear_model import RidgeCV @@ -492,7 +504,7 @@ _ = model.fit(X_train, y_train) ############################################################################## -# First we verify which value of :math:`\alpha` has been selected. +# First we check which value of :math:`\alpha` has been selected. model[-1].regressor_.alpha_ @@ -533,15 +545,18 @@ ############################################################################## # The coefficients are significantly different. -# AGE and EXPERIENCE coefficients are both positive but they have less +# AGE and EXPERIENCE coefficients are both positive but they now have less # influence on the prediction. -# The regularization manages to lower the influence of correlated +# +# The regularization reduces the influence of correlated # variables on the model because the weight is shared between the two -# predictive variables, so neither alone would be very strongly weighted. -# On the other hand, those weights are more robust with respect to -# cross validation (see the :ref:`ridge_regression` User Guide section), -# as is shown in the plot below to be compared with the -# :ref:`previous one`. +# predictive variables, so neither alone would have strong weights. +# +# On the other hand, the weights obtained with regularization are more +# stable (see the :ref:`ridge_regression` User Guide section). This +# increased stability is visible from the plot, obtained from data +# perturbations, in a cross validation. This plot can be compared with +# the :ref:`previous one`. cv_model = cross_validate( model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), @@ -632,14 +647,25 @@ # A Lasso model identifies the correlation between # AGE and EXPERIENCE and suppresses one of them for the sake of the prediction. # +# It is important to keep in mind that the coefficients that have been +# dropped may still be related to the outcome by themselves: the model +# chose to suppress them because they bring little or no additional +# information on top of the other features. Additionnaly, this selection +# is unstable for correlated features, and should be interpreted with +# caution. +# # Lessons learned # --------------- # -# * Feature importance could be extrapolated from the coefficients only after -# having scaled them to the same unit of measure. -# * Coefficients in multiple linear models represent conditional dependencies -# between a given feature and the target. -# * Correlated features induce variability in the coefficients of linear -# models. +# * Coefficients must be scaled to the same unit of measure to retrieve +# feature importance. Scaling them with the standard-deviation of the +# feature is a useful proxy. +# * Coefficients in multivariate linear models represent the dependency +# between a given feature and the target, **conditional** on the other +# features. +# * Correlated features induce instabilities in the coefficients of linear +# models and their effects cannot be well teased apart. # * Different linear models respond differently to feature correlation and # coefficients could significantly vary from one another. +# * Inspecting coefficients across the folds of a cross-validation loop +# gives an idea of their stability.