diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst index f3f9b77152ab2..762a2a9ef74be 100644 --- a/doc/modules/linear_model.rst +++ b/doc/modules/linear_model.rst @@ -50,7 +50,7 @@ and will store the coefficients :math:`w` of the linear model in its The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the -columns of the design matrix :math:`X` have an approximate linear +columns of the design matrix :math:`X` have an approximately linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large @@ -68,7 +68,7 @@ It is possible to constrain all the coefficients to be non-negative, which may be useful when they represent some physical or naturally non-negative quantities (e.g., frequency counts or prices of goods). :class:`LinearRegression` accepts a boolean ``positive`` -parameter: when set to `True` `Non Negative Least Squares +parameter: when set to `True` `Non-Negative Least Squares `_ are then applied. .. topic:: Examples: @@ -140,15 +140,15 @@ the output with the highest value. It might seem questionable to use a (penalized) Least Squares loss to fit a classification model instead of the more traditional logistic or hinge -losses. However in practice all those models can lead to similar +losses. However, in practice, all those models can lead to similar cross-validation scores in terms of accuracy or precision/recall, while the penalized least squares loss used by the :class:`RidgeClassifier` allows for a very different choice of the numerical solvers with distinct computational performance profiles. The :class:`RidgeClassifier` can be significantly faster than e.g. -:class:`LogisticRegression` with a high number of classes, because it is -able to compute the projection matrix :math:`(X^T X)^{-1} X^T` only once. +:class:`LogisticRegression` with a high number of classes because it can +compute the projection matrix :math:`(X^T X)^{-1} X^T` only once. This classifier is sometimes referred to as a `Least Squares Support Vector Machines @@ -210,7 +210,7 @@ Lasso The :class:`Lasso` is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of -features upon which the given solution is dependent. For this reason +features upon which the given solution is dependent. For this reason, Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients (see @@ -309,7 +309,7 @@ as the regularization path is computed only once instead of k+1 times when using k-fold cross-validation. However, such criteria needs a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results) and assume the model -is correct, i.e. that the data are actually generated by this model. +is correct, i.e. that the data are generated by this model. They also tend to break when the problem is badly conditioned (more features than samples). @@ -393,7 +393,7 @@ the regularization properties of :class:`Ridge`. We control the convex combination of :math:`\ell_1` and :math:`\ell_2` using the ``l1_ratio`` parameter. -Elastic-net is useful when there are multiple features which are +Elastic-net is useful when there are multiple features that are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both. @@ -500,7 +500,7 @@ The disadvantages of the LARS method include: in the discussion section of the Efron et al. (2004) Annals of Statistics article. -The LARS model can be used using estimator :class:`Lars`, or its +The LARS model can be used using via the estimator :class:`Lars`, or its low-level implementation :func:`lars_path` or :func:`lars_path_gram`. @@ -546,7 +546,7 @@ the residual. Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the :math:`\ell_1` norm of the parameter vector. The full coefficients path is stored in the array -``coef_path_``, which has size (n_features, max_features+1). The first +``coef_path_`` of shape `(n_features, max_features + 1)`. The first column is always zero. .. topic:: References: diff --git a/examples/inspection/plot_linear_model_coefficient_interpretation.py b/examples/inspection/plot_linear_model_coefficient_interpretation.py index 459b180f00e36..aca0369cdbb03 100644 --- a/examples/inspection/plot_linear_model_coefficient_interpretation.py +++ b/examples/inspection/plot_linear_model_coefficient_interpretation.py @@ -1,7 +1,7 @@ """ -================================================================== -Common pitfalls in interpretation of coefficients of linear models -================================================================== +====================================================================== +Common pitfalls in the interpretation of coefficients of linear models +====================================================================== In linear models, the target value is modeled as a linear combination of the features (see the :ref:`linear_model` User Guide @@ -77,9 +77,7 @@ from sklearn.model_selection import train_test_split -X_train, X_test, y_train, y_test = train_test_split( - X, y, random_state=42 -) +X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # %% # First, let's get some insights by looking at the variable distributions and @@ -90,7 +88,7 @@ train_dataset = X_train.copy() train_dataset.insert(0, "WAGE", y_train) -_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde') +_ = sns.pairplot(train_dataset, kind="reg", diag_kind="kde") # %% # Looking closely at the WAGE distribution reveals that it has a @@ -131,13 +129,11 @@ from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder -categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR', - 'MARR', 'UNION', 'SEX', 'SOUTH'] -numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE'] +categorical_columns = ["RACE", "OCCUPATION", "SECTOR", "MARR", "UNION", "SEX", "SOUTH"] +numerical_columns = ["EDUCATION", "EXPERIENCE", "AGE"] preprocessor = make_column_transformer( - (OneHotEncoder(drop='if_binary'), categorical_columns), - remainder='passthrough' + (OneHotEncoder(drop="if_binary"), categorical_columns), remainder="passthrough" ) # %% @@ -152,10 +148,8 @@ model = make_pipeline( preprocessor, TransformedTargetRegressor( - regressor=Ridge(alpha=1e-10), - func=np.log10, - inverse_func=sp.special.exp10 - ) + regressor=Ridge(alpha=1e-10), func=np.log10, inverse_func=sp.special.exp10 + ), ) # %% @@ -176,17 +170,17 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(5, 5)) plt.scatter(y_test, y_pred) ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red") plt.text(3, 20, string_score) -plt.title('Ridge model, small regularization') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Ridge model, small regularization") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -206,15 +200,17 @@ # First of all, we can take a look to the values of the coefficients of the # regressor we have fitted. -feature_names = (model.named_steps['columntransformer'] - .named_transformers_['onehotencoder'] - .get_feature_names(input_features=categorical_columns)) -feature_names = np.concatenate( - [feature_names, numerical_columns]) +feature_names = ( + model.named_steps["columntransformer"] + .named_transformers_["onehotencoder"] + .get_feature_names(input_features=categorical_columns) +) +feature_names = np.concatenate([feature_names, numerical_columns]) coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) coefs @@ -233,10 +229,10 @@ # hence value ranges, because of their different unit of measure. This is more # visible if we plot the coefficients. -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, small regularization') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, small regularization") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # Indeed, from the plot above the most important factor in determining WAGE @@ -252,13 +248,12 @@ # features. X_train_preprocessed = pd.DataFrame( - model.named_steps['columntransformer'].transform(X_train), - columns=feature_names + model.named_steps["columntransformer"].transform(X_train), columns=feature_names ) -X_train_preprocessed.std(axis=0).plot(kind='barh', figsize=(9, 7)) -plt.title('Features std. dev.') -plt.subplots_adjust(left=.3) +X_train_preprocessed.std(axis=0).plot(kind="barh", figsize=(9, 7)) +plt.title("Features std. dev.") +plt.subplots_adjust(left=0.3) # %% # Multiplying the coefficients by the standard deviation of the related @@ -273,14 +268,15 @@ # coefficient on the output, all else being equal. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.std(axis=0), - columns=['Coefficient importance'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.std(axis=0), + columns=["Coefficient importance"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, small regularization') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, small regularization") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # Now that the coefficients have been scaled, we can safely compare them. @@ -315,22 +311,28 @@ from sklearn.model_selection import RepeatedKFold cv_model = cross_validate( - model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + model, + X, + y, + cv=RepeatedKFold(n_splits=5, n_repeats=5), + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.std(axis=0) - for est in cv_model['estimator']], - columns=feature_names + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.std(axis=0) + for est in cv_model["estimator"] + ], + columns=feature_names, ) plt.figure(figsize=(9, 7)) -sns.stripplot(data=coefs, orient='h', color='k', alpha=0.5) -sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5) -plt.axvline(x=0, color='.5') -plt.xlabel('Coefficient importance') -plt.title('Coefficient importance and its variability') -plt.subplots_adjust(left=.3) +sns.stripplot(data=coefs, orient="h", color="k", alpha=0.5) +sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5) +plt.axvline(x=0, color=".5") +plt.xlabel("Coefficient importance") +plt.title("Coefficient importance and its variability") +plt.subplots_adjust(left=0.3) # %% # The problem of correlated variables @@ -346,14 +348,13 @@ # # .. _covariation: -plt.ylabel('Age coefficient') -plt.xlabel('Experience coefficient') +plt.ylabel("Age coefficient") +plt.xlabel("Experience coefficient") plt.grid(True) plt.xlim(-0.4, 0.5) plt.ylim(-0.4, 0.5) plt.scatter(coefs["AGE"], coefs["EXPERIENCE"]) -_ = plt.title('Co-variations of coefficients for AGE and EXPERIENCE ' - 'across folds') +_ = plt.title("Co-variations of coefficients for AGE and EXPERIENCE across folds") # %% # Two regions are populated: when the EXPERIENCE coefficient is @@ -362,26 +363,31 @@ # To go further we remove one of the 2 features and check what is the impact # on the model stability. -column_to_drop = ['AGE'] +column_to_drop = ["AGE"] cv_model = cross_validate( - model, X.drop(columns=column_to_drop), y, + model, + X.drop(columns=column_to_drop), + y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.drop(columns=column_to_drop).std(axis=0) - for est in cv_model['estimator']], - columns=feature_names[:-1] + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.drop(columns=column_to_drop).std(axis=0) + for est in cv_model["estimator"] + ], + columns=feature_names[:-1], ) plt.figure(figsize=(9, 7)) -sns.stripplot(data=coefs, orient='h', color='k', alpha=0.5) -sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5) -plt.axvline(x=0, color='.5') -plt.title('Coefficient importance and its variability') -plt.xlabel('Coefficient importance') -plt.subplots_adjust(left=.3) +sns.stripplot(data=coefs, orient="h", color="k", alpha=0.5) +sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5) +plt.axvline(x=0, color=".5") +plt.title("Coefficient importance and its variability") +plt.xlabel("Coefficient importance") +plt.subplots_adjust(left=0.3) # %% # The estimation of the EXPERIENCE coefficient is now less variable and @@ -402,9 +408,9 @@ from sklearn.preprocessing import StandardScaler preprocessor = make_column_transformer( - (OneHotEncoder(drop='if_binary'), categorical_columns), + (OneHotEncoder(drop="if_binary"), categorical_columns), (StandardScaler(), numerical_columns), - remainder='passthrough' + remainder="passthrough", ) # %% @@ -413,10 +419,8 @@ model = make_pipeline( preprocessor, TransformedTargetRegressor( - regressor=Ridge(alpha=1e-10), - func=np.log10, - inverse_func=sp.special.exp10 - ) + regressor=Ridge(alpha=1e-10), func=np.log10, inverse_func=sp.special.exp10 + ), ) _ = model.fit(X_train, y_train) @@ -428,19 +432,19 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(y_test, y_pred) ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red") plt.text(3, 20, string_score) -plt.title('Ridge model, small regularization, normalized variables') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Ridge model, small regularization, normalized variables") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -448,32 +452,39 @@ # For the coefficient analysis, scaling is not needed this time. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, small regularization, normalized variables') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, small regularization, normalized variables") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # We now inspect the coefficients across several cross-validation folds. cv_model = cross_validate( - model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + model, + X, + y, + cv=RepeatedKFold(n_splits=5, n_repeats=5), + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ - for est in cv_model['estimator']], - columns=feature_names + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + for est in cv_model["estimator"] + ], + columns=feature_names, ) plt.figure(figsize=(9, 7)) -sns.stripplot(data=coefs, orient='h', color='k', alpha=0.5) -sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5) -plt.axvline(x=0, color='.5') -plt.title('Coefficient variability') -plt.subplots_adjust(left=.3) +sns.stripplot(data=coefs, orient="h", color="k", alpha=0.5) +sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5) +plt.axvline(x=0, color=".5") +plt.title("Coefficient variability") +plt.subplots_adjust(left=0.3) # %% # The result is quite similar to the non-normalized case. @@ -497,8 +508,8 @@ TransformedTargetRegressor( regressor=RidgeCV(alphas=np.logspace(-10, 10, 21)), func=np.log10, - inverse_func=sp.special.exp10 - ) + inverse_func=sp.special.exp10, + ), ) _ = model.fit(X_train, y_train) @@ -513,10 +524,10 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(y_test, y_pred) @@ -524,9 +535,9 @@ plt.text(3, 20, string_score) -plt.title('Ridge model, regularization, normalized variables') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Ridge model, regularization, normalized variables") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -535,13 +546,14 @@ # the one of the non-regularized model. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, regularization, normalized variables') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, regularization, normalized variables") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # The coefficients are significantly different. @@ -559,24 +571,29 @@ # the :ref:`previous one`. cv_model = cross_validate( - model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + model, + X, + y, + cv=RepeatedKFold(n_splits=5, n_repeats=5), + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.std(axis=0) - for est in cv_model['estimator']], - columns=feature_names + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.std(axis=0) + for est in cv_model["estimator"] + ], + columns=feature_names, ) -plt.ylabel('Age coefficient') -plt.xlabel('Experience coefficient') +plt.ylabel("Age coefficient") +plt.xlabel("Experience coefficient") plt.grid(True) plt.xlim(-0.4, 0.5) plt.ylim(-0.4, 0.5) plt.scatter(coefs["AGE"], coefs["EXPERIENCE"]) -_ = plt.title('Co-variations of coefficients for AGE and EXPERIENCE ' - 'across folds') +_ = plt.title("Co-variations of coefficients for AGE and EXPERIENCE across folds") # %% # Linear models with sparse coefficients @@ -598,8 +615,8 @@ TransformedTargetRegressor( regressor=LassoCV(alphas=np.logspace(-10, 10, 21), max_iter=100000), func=np.log10, - inverse_func=sp.special.exp10 - ) + inverse_func=sp.special.exp10, + ), ) _ = model.fit(X_train, y_train) @@ -614,10 +631,10 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(y_test, y_pred) @@ -625,9 +642,9 @@ plt.text(3, 20, string_score) -plt.title('Lasso model, regularization, normalized variables') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Lasso model, regularization, normalized variables") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -635,13 +652,14 @@ # For our dataset, again the model is not very predictive. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Lasso model, regularization, normalized variables') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Lasso model, regularization, normalized variables") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # A Lasso model identifies the correlation between