From 63a97b2837ca358297f21deb2d28212b903ab4e8 Mon Sep 17 00:00:00 2001 From: Mohamed Khoualed Date: Thu, 25 Feb 2021 10:44:07 +0100 Subject: [PATCH 1/2] Fixing documentaion typos --- doc/modules/linear_model.rst | 36 +++++++++---------- ...linear_model_coefficient_interpretation.py | 2 +- 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst index f1f376dc641c9..152e2bc024068 100644 --- a/doc/modules/linear_model.rst +++ b/doc/modules/linear_model.rst @@ -50,8 +50,8 @@ and will store the coefficients :math:`w` of the linear model in its The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the -columns of the design matrix :math:`X` have an approximate linear -dependence, the design matrix becomes close to singular +columns of the design matrix :math:`X` have an approximately linear +dependence, the design matrix becomes close to a singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of *multicollinearity* can arise, for @@ -68,7 +68,7 @@ It is possible to constrain all the coefficients to be non-negative, which may be useful when they represent some physical or naturally non-negative quantities (e.g., frequency counts or prices of goods). :class:`LinearRegression` accepts a boolean ``positive`` -parameter: when set to `True` `Non Negative Least Squares +parameter: when set to `True` `Non-Negative Least Squares `_ are then applied. .. topic:: Examples: @@ -140,15 +140,15 @@ the output with the highest value. It might seem questionable to use a (penalized) Least Squares loss to fit a classification model instead of the more traditional logistic or hinge -losses. However in practice all those models can lead to similar +losses. However, in practice, all those models can lead to similar cross-validation scores in terms of accuracy or precision/recall, while the penalized least squares loss used by the :class:`RidgeClassifier` allows for a very different choice of the numerical solvers with distinct computational performance profiles. The :class:`RidgeClassifier` can be significantly faster than e.g. -:class:`LogisticRegression` with a high number of classes, because it is -able to compute the projection matrix :math:`(X^T X)^{-1} X^T` only once. +:class:`LogisticRegression` with a high number of classes because it can +compute the projection matrix :math:`(X^T X)^{-1} X^T` only once. This classifier is sometimes referred to as a `Least Squares Support Vector Machines @@ -210,7 +210,7 @@ Lasso The :class:`Lasso` is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of -features upon which the given solution is dependent. For this reason +features upon which the given solution is dependent. For this reason, Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients (see @@ -306,10 +306,10 @@ Alternatively, the estimator :class:`LassoLarsIC` proposes to use the Akaike information criterion (AIC) and the Bayes Information criterion (BIC). It is a computationally cheaper alternative to find the optimal value of alpha as the regularization path is computed only once instead of k+1 times -when using k-fold cross-validation. However, such criteria needs a +when using k-fold cross-validation. However, such criteria need a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results) and assume the model -is correct, i.e. that the data are actually generated by this model. +is correct, i.e. that the data are generated by this model. They also tend to break when the problem is badly conditioned (more features than samples). @@ -393,7 +393,7 @@ the regularization properties of :class:`Ridge`. We control the convex combination of :math:`\ell_1` and :math:`\ell_2` using the ``l1_ratio`` parameter. -Elastic-net is useful when there are multiple features which are +Elastic-net is useful when there are multiple features that are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both. @@ -469,8 +469,8 @@ Least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. LARS is similar to forward stepwise regression. At each step, it finds the feature most correlated with the -target. When there are multiple features having equal correlation, instead -of continuing along the same feature, it proceeds in a direction equiangular +target. When multiple features having equal correlation, instead +of continuing along with the same feature, it proceeds in a direction equiangular between the features. The advantages of LARS are: @@ -478,7 +478,7 @@ The advantages of LARS are: - It is numerically efficient in contexts where the number of features is significantly greater than the number of samples. - - It is computationally just as fast as forward selection and has + - It is computationally just as fast as forwarding selection and has the same order of complexity as ordinary least squares. - It produces a full piecewise linear solution path, which is @@ -500,7 +500,7 @@ The disadvantages of the LARS method include: in the discussion section of the Efron et al. (2004) Annals of Statistics article. -The LARS model can be used using estimator :class:`Lars`, or its +The LARS model can be used using the estimator :class:`Lars`, or its low-level implementation :func:`lars_path` or :func:`lars_path_gram`. @@ -530,7 +530,7 @@ function of the norm of its coefficients. * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_lars.py` -The Lars algorithm provides the full path of the coefficients along +The Lars algorithm provides the full path of the coefficients along with the regularization parameter almost for free, thus a common operation is to retrieve the path with one of the functions :func:`lars_path` or :func:`lars_path_gram`. @@ -546,7 +546,7 @@ the residual. Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the :math:`\ell_1` norm of the parameter vector. The full coefficients path is stored in the array -``coef_path_``, which has size (n_features, max_features+1). The first +``coef_path_``, which has a size (n_features, max_features+1). The first column is always zero. .. topic:: References: @@ -565,13 +565,13 @@ algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the :math:`\ell_0` pseudo-norm). Being a forward feature selection method like :ref:`least_angle_regression`, -orthogonal matching pursuit can approximate the optimum solution vector with a +the orthogonal matching pursuit can approximate the optimum solution vector with a fixed number of non-zero elements: .. math:: \underset{w}{\operatorname{arg\,min\,}} ||y - Xw||_2^2 \text{ subject to } ||w||_0 \leq n_{\text{nonzero\_coefs}} -Alternatively, orthogonal matching pursuit can target a specific error instead +Alternatively,the orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coefficients. This can be expressed as: .. math:: diff --git a/examples/inspection/plot_linear_model_coefficient_interpretation.py b/examples/inspection/plot_linear_model_coefficient_interpretation.py index 459b180f00e36..8f7d3d26e001e 100644 --- a/examples/inspection/plot_linear_model_coefficient_interpretation.py +++ b/examples/inspection/plot_linear_model_coefficient_interpretation.py @@ -1,6 +1,6 @@ """ ================================================================== -Common pitfalls in interpretation of coefficients of linear models +Common pitfalls in the interpretation of coefficients of linear models ================================================================== In linear models, the target value is modeled as From d9267c64da962d62295ca3fe2ea5f8701e8d9c5c Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre Date: Thu, 29 Jul 2021 17:38:57 +0200 Subject: [PATCH 2/2] Apply ogrisel suggestions --- doc/modules/linear_model.rst | 20 +- ...linear_model_coefficient_interpretation.py | 288 ++++++++++-------- 2 files changed, 163 insertions(+), 145 deletions(-) diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst index 576fe74be03f5..762a2a9ef74be 100644 --- a/doc/modules/linear_model.rst +++ b/doc/modules/linear_model.rst @@ -51,7 +51,7 @@ and will store the coefficients :math:`w` of the linear model in its The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix :math:`X` have an approximately linear -dependence, the design matrix becomes close to a singular +dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of *multicollinearity* can arise, for @@ -306,7 +306,7 @@ Alternatively, the estimator :class:`LassoLarsIC` proposes to use the Akaike information criterion (AIC) and the Bayes Information criterion (BIC). It is a computationally cheaper alternative to find the optimal value of alpha as the regularization path is computed only once instead of k+1 times -when using k-fold cross-validation. However, such criteria need a +when using k-fold cross-validation. However, such criteria needs a proper estimation of the degrees of freedom of the solution, are derived for large samples (asymptotic results) and assume the model is correct, i.e. that the data are generated by this model. @@ -469,8 +469,8 @@ Least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. LARS is similar to forward stepwise regression. At each step, it finds the feature most correlated with the -target. When multiple features having equal correlation, instead -of continuing along with the same feature, it proceeds in a direction equiangular +target. When there are multiple features having equal correlation, instead +of continuing along the same feature, it proceeds in a direction equiangular between the features. The advantages of LARS are: @@ -478,7 +478,7 @@ The advantages of LARS are: - It is numerically efficient in contexts where the number of features is significantly greater than the number of samples. - - It is computationally just as fast as forwarding selection and has + - It is computationally just as fast as forward selection and has the same order of complexity as ordinary least squares. - It produces a full piecewise linear solution path, which is @@ -500,7 +500,7 @@ The disadvantages of the LARS method include: in the discussion section of the Efron et al. (2004) Annals of Statistics article. -The LARS model can be used using the estimator :class:`Lars`, or its +The LARS model can be used using via the estimator :class:`Lars`, or its low-level implementation :func:`lars_path` or :func:`lars_path_gram`. @@ -530,7 +530,7 @@ function of the norm of its coefficients. * :ref:`sphx_glr_auto_examples_linear_model_plot_lasso_lars.py` -The Lars algorithm provides the full path of the coefficients along with +The Lars algorithm provides the full path of the coefficients along the regularization parameter almost for free, thus a common operation is to retrieve the path with one of the functions :func:`lars_path` or :func:`lars_path_gram`. @@ -546,7 +546,7 @@ the residual. Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the :math:`\ell_1` norm of the parameter vector. The full coefficients path is stored in the array -``coef_path_``, which has a size (n_features, max_features+1). The first +``coef_path_`` of shape `(n_features, max_features + 1)`. The first column is always zero. .. topic:: References: @@ -565,13 +565,13 @@ algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the :math:`\ell_0` pseudo-norm). Being a forward feature selection method like :ref:`least_angle_regression`, -the orthogonal matching pursuit can approximate the optimum solution vector with a +orthogonal matching pursuit can approximate the optimum solution vector with a fixed number of non-zero elements: .. math:: \underset{w}{\operatorname{arg\,min\,}} ||y - Xw||_2^2 \text{ subject to } ||w||_0 \leq n_{\text{nonzero\_coefs}} -Alternatively,the orthogonal matching pursuit can target a specific error instead +Alternatively, orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coefficients. This can be expressed as: .. math:: diff --git a/examples/inspection/plot_linear_model_coefficient_interpretation.py b/examples/inspection/plot_linear_model_coefficient_interpretation.py index 8f7d3d26e001e..aca0369cdbb03 100644 --- a/examples/inspection/plot_linear_model_coefficient_interpretation.py +++ b/examples/inspection/plot_linear_model_coefficient_interpretation.py @@ -1,7 +1,7 @@ """ -================================================================== +====================================================================== Common pitfalls in the interpretation of coefficients of linear models -================================================================== +====================================================================== In linear models, the target value is modeled as a linear combination of the features (see the :ref:`linear_model` User Guide @@ -77,9 +77,7 @@ from sklearn.model_selection import train_test_split -X_train, X_test, y_train, y_test = train_test_split( - X, y, random_state=42 -) +X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) # %% # First, let's get some insights by looking at the variable distributions and @@ -90,7 +88,7 @@ train_dataset = X_train.copy() train_dataset.insert(0, "WAGE", y_train) -_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde') +_ = sns.pairplot(train_dataset, kind="reg", diag_kind="kde") # %% # Looking closely at the WAGE distribution reveals that it has a @@ -131,13 +129,11 @@ from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder -categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR', - 'MARR', 'UNION', 'SEX', 'SOUTH'] -numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE'] +categorical_columns = ["RACE", "OCCUPATION", "SECTOR", "MARR", "UNION", "SEX", "SOUTH"] +numerical_columns = ["EDUCATION", "EXPERIENCE", "AGE"] preprocessor = make_column_transformer( - (OneHotEncoder(drop='if_binary'), categorical_columns), - remainder='passthrough' + (OneHotEncoder(drop="if_binary"), categorical_columns), remainder="passthrough" ) # %% @@ -152,10 +148,8 @@ model = make_pipeline( preprocessor, TransformedTargetRegressor( - regressor=Ridge(alpha=1e-10), - func=np.log10, - inverse_func=sp.special.exp10 - ) + regressor=Ridge(alpha=1e-10), func=np.log10, inverse_func=sp.special.exp10 + ), ) # %% @@ -176,17 +170,17 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(5, 5)) plt.scatter(y_test, y_pred) ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red") plt.text(3, 20, string_score) -plt.title('Ridge model, small regularization') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Ridge model, small regularization") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -206,15 +200,17 @@ # First of all, we can take a look to the values of the coefficients of the # regressor we have fitted. -feature_names = (model.named_steps['columntransformer'] - .named_transformers_['onehotencoder'] - .get_feature_names(input_features=categorical_columns)) -feature_names = np.concatenate( - [feature_names, numerical_columns]) +feature_names = ( + model.named_steps["columntransformer"] + .named_transformers_["onehotencoder"] + .get_feature_names(input_features=categorical_columns) +) +feature_names = np.concatenate([feature_names, numerical_columns]) coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) coefs @@ -233,10 +229,10 @@ # hence value ranges, because of their different unit of measure. This is more # visible if we plot the coefficients. -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, small regularization') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, small regularization") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # Indeed, from the plot above the most important factor in determining WAGE @@ -252,13 +248,12 @@ # features. X_train_preprocessed = pd.DataFrame( - model.named_steps['columntransformer'].transform(X_train), - columns=feature_names + model.named_steps["columntransformer"].transform(X_train), columns=feature_names ) -X_train_preprocessed.std(axis=0).plot(kind='barh', figsize=(9, 7)) -plt.title('Features std. dev.') -plt.subplots_adjust(left=.3) +X_train_preprocessed.std(axis=0).plot(kind="barh", figsize=(9, 7)) +plt.title("Features std. dev.") +plt.subplots_adjust(left=0.3) # %% # Multiplying the coefficients by the standard deviation of the related @@ -273,14 +268,15 @@ # coefficient on the output, all else being equal. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.std(axis=0), - columns=['Coefficient importance'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.std(axis=0), + columns=["Coefficient importance"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, small regularization') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, small regularization") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # Now that the coefficients have been scaled, we can safely compare them. @@ -315,22 +311,28 @@ from sklearn.model_selection import RepeatedKFold cv_model = cross_validate( - model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + model, + X, + y, + cv=RepeatedKFold(n_splits=5, n_repeats=5), + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.std(axis=0) - for est in cv_model['estimator']], - columns=feature_names + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.std(axis=0) + for est in cv_model["estimator"] + ], + columns=feature_names, ) plt.figure(figsize=(9, 7)) -sns.stripplot(data=coefs, orient='h', color='k', alpha=0.5) -sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5) -plt.axvline(x=0, color='.5') -plt.xlabel('Coefficient importance') -plt.title('Coefficient importance and its variability') -plt.subplots_adjust(left=.3) +sns.stripplot(data=coefs, orient="h", color="k", alpha=0.5) +sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5) +plt.axvline(x=0, color=".5") +plt.xlabel("Coefficient importance") +plt.title("Coefficient importance and its variability") +plt.subplots_adjust(left=0.3) # %% # The problem of correlated variables @@ -346,14 +348,13 @@ # # .. _covariation: -plt.ylabel('Age coefficient') -plt.xlabel('Experience coefficient') +plt.ylabel("Age coefficient") +plt.xlabel("Experience coefficient") plt.grid(True) plt.xlim(-0.4, 0.5) plt.ylim(-0.4, 0.5) plt.scatter(coefs["AGE"], coefs["EXPERIENCE"]) -_ = plt.title('Co-variations of coefficients for AGE and EXPERIENCE ' - 'across folds') +_ = plt.title("Co-variations of coefficients for AGE and EXPERIENCE across folds") # %% # Two regions are populated: when the EXPERIENCE coefficient is @@ -362,26 +363,31 @@ # To go further we remove one of the 2 features and check what is the impact # on the model stability. -column_to_drop = ['AGE'] +column_to_drop = ["AGE"] cv_model = cross_validate( - model, X.drop(columns=column_to_drop), y, + model, + X.drop(columns=column_to_drop), + y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.drop(columns=column_to_drop).std(axis=0) - for est in cv_model['estimator']], - columns=feature_names[:-1] + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.drop(columns=column_to_drop).std(axis=0) + for est in cv_model["estimator"] + ], + columns=feature_names[:-1], ) plt.figure(figsize=(9, 7)) -sns.stripplot(data=coefs, orient='h', color='k', alpha=0.5) -sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5) -plt.axvline(x=0, color='.5') -plt.title('Coefficient importance and its variability') -plt.xlabel('Coefficient importance') -plt.subplots_adjust(left=.3) +sns.stripplot(data=coefs, orient="h", color="k", alpha=0.5) +sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5) +plt.axvline(x=0, color=".5") +plt.title("Coefficient importance and its variability") +plt.xlabel("Coefficient importance") +plt.subplots_adjust(left=0.3) # %% # The estimation of the EXPERIENCE coefficient is now less variable and @@ -402,9 +408,9 @@ from sklearn.preprocessing import StandardScaler preprocessor = make_column_transformer( - (OneHotEncoder(drop='if_binary'), categorical_columns), + (OneHotEncoder(drop="if_binary"), categorical_columns), (StandardScaler(), numerical_columns), - remainder='passthrough' + remainder="passthrough", ) # %% @@ -413,10 +419,8 @@ model = make_pipeline( preprocessor, TransformedTargetRegressor( - regressor=Ridge(alpha=1e-10), - func=np.log10, - inverse_func=sp.special.exp10 - ) + regressor=Ridge(alpha=1e-10), func=np.log10, inverse_func=sp.special.exp10 + ), ) _ = model.fit(X_train, y_train) @@ -428,19 +432,19 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(y_test, y_pred) ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red") plt.text(3, 20, string_score) -plt.title('Ridge model, small regularization, normalized variables') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Ridge model, small regularization, normalized variables") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -448,32 +452,39 @@ # For the coefficient analysis, scaling is not needed this time. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, small regularization, normalized variables') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, small regularization, normalized variables") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # We now inspect the coefficients across several cross-validation folds. cv_model = cross_validate( - model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + model, + X, + y, + cv=RepeatedKFold(n_splits=5, n_repeats=5), + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ - for est in cv_model['estimator']], - columns=feature_names + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + for est in cv_model["estimator"] + ], + columns=feature_names, ) plt.figure(figsize=(9, 7)) -sns.stripplot(data=coefs, orient='h', color='k', alpha=0.5) -sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5) -plt.axvline(x=0, color='.5') -plt.title('Coefficient variability') -plt.subplots_adjust(left=.3) +sns.stripplot(data=coefs, orient="h", color="k", alpha=0.5) +sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5) +plt.axvline(x=0, color=".5") +plt.title("Coefficient variability") +plt.subplots_adjust(left=0.3) # %% # The result is quite similar to the non-normalized case. @@ -497,8 +508,8 @@ TransformedTargetRegressor( regressor=RidgeCV(alphas=np.logspace(-10, 10, 21)), func=np.log10, - inverse_func=sp.special.exp10 - ) + inverse_func=sp.special.exp10, + ), ) _ = model.fit(X_train, y_train) @@ -513,10 +524,10 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(y_test, y_pred) @@ -524,9 +535,9 @@ plt.text(3, 20, string_score) -plt.title('Ridge model, regularization, normalized variables') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Ridge model, regularization, normalized variables") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -535,13 +546,14 @@ # the one of the non-regularized model. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Ridge model, regularization, normalized variables') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Ridge model, regularization, normalized variables") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # The coefficients are significantly different. @@ -559,24 +571,29 @@ # the :ref:`previous one`. cv_model = cross_validate( - model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5), - return_estimator=True, n_jobs=-1 + model, + X, + y, + cv=RepeatedKFold(n_splits=5, n_repeats=5), + return_estimator=True, + n_jobs=-1, ) coefs = pd.DataFrame( - [est.named_steps['transformedtargetregressor'].regressor_.coef_ * - X_train_preprocessed.std(axis=0) - for est in cv_model['estimator']], - columns=feature_names + [ + est.named_steps["transformedtargetregressor"].regressor_.coef_ + * X_train_preprocessed.std(axis=0) + for est in cv_model["estimator"] + ], + columns=feature_names, ) -plt.ylabel('Age coefficient') -plt.xlabel('Experience coefficient') +plt.ylabel("Age coefficient") +plt.xlabel("Experience coefficient") plt.grid(True) plt.xlim(-0.4, 0.5) plt.ylim(-0.4, 0.5) plt.scatter(coefs["AGE"], coefs["EXPERIENCE"]) -_ = plt.title('Co-variations of coefficients for AGE and EXPERIENCE ' - 'across folds') +_ = plt.title("Co-variations of coefficients for AGE and EXPERIENCE across folds") # %% # Linear models with sparse coefficients @@ -598,8 +615,8 @@ TransformedTargetRegressor( regressor=LassoCV(alphas=np.logspace(-10, 10, 21), max_iter=100000), func=np.log10, - inverse_func=sp.special.exp10 - ) + inverse_func=sp.special.exp10, + ), ) _ = model.fit(X_train, y_train) @@ -614,10 +631,10 @@ y_pred = model.predict(X_train) mae = median_absolute_error(y_train, y_pred) -string_score = f'MAE on training set: {mae:.2f} $/hour' +string_score = f"MAE on training set: {mae:.2f} $/hour" y_pred = model.predict(X_test) mae = median_absolute_error(y_test, y_pred) -string_score += f'\nMAE on testing set: {mae:.2f} $/hour' +string_score += f"\nMAE on testing set: {mae:.2f} $/hour" fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(y_test, y_pred) @@ -625,9 +642,9 @@ plt.text(3, 20, string_score) -plt.title('Lasso model, regularization, normalized variables') -plt.ylabel('Model predictions') -plt.xlabel('Truths') +plt.title("Lasso model, regularization, normalized variables") +plt.ylabel("Model predictions") +plt.xlabel("Truths") plt.xlim([0, 27]) _ = plt.ylim([0, 27]) @@ -635,13 +652,14 @@ # For our dataset, again the model is not very predictive. coefs = pd.DataFrame( - model.named_steps['transformedtargetregressor'].regressor_.coef_, - columns=['Coefficients'], index=feature_names + model.named_steps["transformedtargetregressor"].regressor_.coef_, + columns=["Coefficients"], + index=feature_names, ) -coefs.plot(kind='barh', figsize=(9, 7)) -plt.title('Lasso model, regularization, normalized variables') -plt.axvline(x=0, color='.5') -plt.subplots_adjust(left=.3) +coefs.plot(kind="barh", figsize=(9, 7)) +plt.title("Lasso model, regularization, normalized variables") +plt.axvline(x=0, color=".5") +plt.subplots_adjust(left=0.3) # %% # A Lasso model identifies the correlation between