8000 DOC Rework Importance of Feature Scaling example by ArturoAmorQ · Pull Request #25012 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC Rework Importance of Feature Scaling example #25012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jan 19, 2023
Merged
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
cf0280e
First step to improve notebook style
Nov 21, 2022
f2636d4
Add KNeighbors section and refactor narrative
Nov 22, 2022
43e87c3
Tweak
Nov 22, 2022
9dea8cc
Merge branch 'main' into scaling_importance
glemaitre Nov 23, 2022
9e373a2
Tweak
Nov 24, 2022
3ec7569
Merge branch 'scaling_importance' of github.com:ArturoAmorQ/scikit-le…
Nov 24, 2022
14eb647
Tweak
Nov 24, 2022
b15210b
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…
Nov 24, 2022
4b6d685
Add clarifying text on using subset of features
Dec 1, 2022
edc65b8
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…
Dec 5, 2022
e3afbaf
Tweak
Dec 6, 2022
6d14bb6
Merge branch 'main' into scaling_importance
jjerphan Dec 6, 2022
0491e92
Update examples/preprocessing/plot_scaling_importance.py
ArturoAmorQ Dec 9, 2022
41d98de
Merge branch 'scaling_importance' of github.com:ArturoAmorQ/scikit-le…
Dec 9, 2022
803a0a9
Apply wording suggestion from Christian
Dec 9, 2022
a2d5085
Apply suggestions from code review
ArturoAmorQ Dec 9, 2022
25e198e
Merge branch 'scaling_importance' of github.com:ArturoAmorQ/scikit-le…
Dec 9, 2022
d21224e
Use plt.show only for last plot
Dec 9, 2022
575c2ec
Use set_output to retain pandas frames
Dec 9, 2022 8000
fb6ec53
Use pandas for displaying feature names
Dec 9, 2022
78c025c
Make plot more squared
Dec 9, 2022
0ea3afd
Add interpretation to plot
Dec 9, 2022
3da42f7
Add discussion on regularization parameter
Dec 12, 2022
407c69a
Add discussion on log-loss
Dec 12, 2022
da8f4ad
Apply suggestions from code review
ArturoAmorQ Dec 15, 2022
1cfbcfa
Fix format
Dec 15, 2022
7c40145
Avoid repetitive print
Jan 16, 2023
8ed2ad0
Merge log-loss and accuracy discussions
Jan 16, 2023
5702586
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…
Jan 16, 2023
ff8ba3b
Apply suggestions from code review
ArturoAmorQ Jan 17, 2023
babd485
Fix format
Jan 17, 2023
da49dc5
Add barplot for easier visualization
Jan 17, 2023
de3dad1
Apply suggestions from code review
ArturoAmorQ Jan 19, 2023
62c495a
Avoid possible command line interruption created by plt.show
Jan 19, 2023
08fa0a3
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…
Jan 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
284 changes: 200 additions & 84 deletions examples/preprocessing/plot_scaling_importance.py
Original file line number Diff line number Diff line change
@@ -1,106 +1,154 @@
# -*- coding: utf-8 -*-
"""
=========================================================
=============================
Importance of Feature Scaling
=========================================================

Feature scaling through standardization (or Z-score normalization)
can be an important preprocessing step for many machine learning
algorithms. Standardization involves rescaling the features such
that they have the properties of a standard normal distribution
with a mean of zero and a standard deviation of one.

While many algorithms (such as SVM, K-nearest neighbors, and logistic
regression) require features to be normalized, intuitively we can
think of Principle Component Analysis (PCA) as being a prime example
of when normalization is important. In PCA we are interested in the
components that maximize the variance. If one component (e.g. human
height) varies less than another (e.g. weight) because of their
respective scales (meters vs. kilos), PCA might determine that the
direction of maximal variance more closely corresponds with the
'weight' axis, if those features are not scaled. As a change in
height of one meter can be considered much more important than the
change in weight of one kilogram, this is clearly incorrect.

To illustrate this, :class:`PCA <sklearn.decomposition.PCA>`
is performed comparing the use of data with
:class:`StandardScaler <sklearn.preprocessing.StandardScaler>` applied,
to unscaled data. The results are visualized and a clear difference noted.
The 1st principal component in the unscaled set can be seen. It can be seen
that feature #13 dominates the direction, being a whole two orders of
magnitude above the other features. This is contrasted when observing
the principal component for the scaled version of the data. In the scaled
version, the orders of magnitude are roughly the same across all the features.

The dataset used is the Wine Dataset available at UCI. This dataset
has continuous features that are heterogeneous in scale due to differing
properties that they measure (i.e. alcohol content and malic acid).

The transformed data is then used to train a naive Bayes classifier, and a
clear difference in prediction accuracies is observed wherein the dataset
which is scaled before PCA vastly outperforms the unscaled version.
=============================

"""
import matplotlib.pyplot as plt
Feature scaling through standardization, also called Z-score normalization, is
an important preprocessing step for many machine learning algorithms. It
involves rescaling each feature such that it has a standard deviation of 1 and a
mean of 0.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.pipeline import make_pipeline
Even if tree based models are (almost) not affected by scaling, many other
algorithms require features to be normalized, often for different reasons: to
ease the convergence (such as a non-penalized logistic regression), to create a
completely different model fit compared to the fit with unscaled data (such as
KNeighbors models). The latter is demoed on the first part of the present
example.

# Code source: Tyler Lanigan <tylerlanigan@gmail.com>
# Sebastian Raschka <mail@sebastianraschka.com>
On the second part of the example we show how Principle Component Analysis (PCA)
is impacted by normalization of features. To illustrate this, we compare the
principal components found using :class:`~sklearn.decomposition.PCA` on unscaled
data with those obatined when using a
:class:`~sklearn.preprocessing.StandardScaler` to scale data first.

In the last part of the example we show the effect of the normalization on the
accuracy of a model trained on PCA-reduced data.

"""

# Author: Tyler Lanigan <tylerlanigan@gmail.com>
# Sebastian Raschka <mail@sebastianraschka.com>
# Arturo Amor <david-arturo.amor-quiroz@inria.fr>
# License: BSD 3 clause

RANDOM_STATE = 42
FIG_SIZE = (10, 7)
# %%
# Load and prepare data
# =====================
#
# The dataset used is the :ref:`wine_dataset` available at UCI. This dataset has
# continuous features that are heterogeneous in scale due to differing
# properties that they measure (e.g. alcohol content and malic acid).

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

features, target = load_wine(return_X_y=True)
X, y = load_wine(return_X_y=True, as_frame=True)
scaler = StandardScaler().set_output(transform="pandas")

# Make a train/test split using 30% test size
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.30, random_state=RANDOM_STATE
X, y, test_size=0.30, random_state=42
)
scaled_X_train = scaler.fit_transform(X_train)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: We could show the mean value of each feature, or min and max.

# %%
# Effect of rescaling on a k-neighbors models
# ===========================================
#
# For the sake of visualizing the decision boundary of a
# :class:`~sklearn.neighbors.KNeighborsClassifier`, in this section we select a
# subset of 2 features that have values with different orders of magnitude.
#
# Keep in mind that using a subset of the features to train the model may likely
# leave out feature with high predictive impact, resulting in a decision
# boundary that is much worse in comparison to a model trained on the full set
# of features.

# Fit to data and predict using pipelined GNB and PCA
unscaled_clf = make_pipeline(PCA(n_components=2), GaussianNB())
unscaled_clf.fit(X_train, y_train)
pred_test = unscaled_clf.predict(X_test)

# Fit to data and predict using pipelined scaling, GNB and PCA
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())
std_clf.fit(X_train, y_train)
pred_test_std = std_clf.predict(X_test)
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.neighbors import KNeighborsClassifier

# Show prediction accuracies in scaled and unscaled data.
print("\nPrediction accuracy for the normal test dataset with PCA")
print(f"{accuracy_score(y_test, pred_test):.2%}\n")

print("\nPrediction accuracy for the standardized test dataset with PCA")
print(f"{accuracy_score(y_test, pred_test_std):.2%}\n")
X_plot = X[["proline", "hue"]]
X_plot_scaled = scaler.fit_transform(X_plot)
clf = KNeighborsClassifier(n_neighbors=20)

# Extract PCA from pipeline
pca = unscaled_clf.named_steps["pca"]
pca_std = std_clf.named_steps["pca"]

# Show first principal components
print(f"\nPC 1 without scaling:\n{pca.components_[0]}")
print(f"\nPC 1 with scaling:\n{pca_std.components_[0]}")
def fit_and_plot_model(X_plot, y, clf, ax):
clf.fit(X_plot, y)
disp = DecisionBoundaryDisplay.from_estimator(
clf,
X_plot,
response_method="predict",
alpha=0.5,
ax=ax,
)
disp.ax_.scatter(X_plot["proline"], X_plot["hue"], c=y, s=20, edgecolor="k")
disp.ax_.set_xlim((X_plot["proline"].min(), X_plot["proline"].max()))
disp.ax_.set_ylim((X_plot["hue"].min(), X_plot["hue"].max()))
return disp.ax_


fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6))

fit_and_plot_model(X_plot, y, clf, ax1)
ax1.set_title("KNN without scaling")

fit_and_plot_model(X_plot_scaled, y, clf, ax2)
ax2.set_xlabel("scaled proline")
ax2.set_ylabel("scaled hue")
_ = ax2.set_title("KNN with scaling")

# %%
# Here the desicion boundary shows that fitting scaled or non-scaled data lead
# to completely different models. The reason is that the variable "proline" has
# values which vary between 0 and 1,000; whereas the variable "hue" varies
# between 1 and 10. Because of this, distances between samples are mostly
# impacted by differences in values of "proline", while values of the "hue" will
# be comparatively ignored. If one uses
# :class:`~sklearn.preprocessing.StandardScaler` to normalize this database,
# both scaled values lay approximately between -3 and 3 and the neighbors
# structure will be impacted more or less equivalently by both variables.
#
# Effect of rescaling on a PCA dimensional reduction
# ==================================================
#
# Dimensional reduction using :class:`~sklearn.decomposition.PCA` consists of
# finding the features that maximize the variance. If one feature varies more
# than the others only because of their respective scales,
# :class:`~sklearn.decomposition.PCA` would determine that such feature
# dominates the direction of the principal components.
#
# We can inspect the first principal components using all the original features:

import pandas as pd
from sklearn.decomposition import PCA

# Use PCA without and with scale on X_train data for visualization.
pca = PCA(n_components=2).fit(X_train)
scaled_pca = PCA(n_components=2).fit(scaled_X_train)
X_train_transformed = pca.transform(X_train)
X_train_std_transformed = scaled_pca.transform(scaled_X_train)

first_pca_component = pd.DataFrame(
pca.components_[0], index=X.columns, columns=["without scaling"]
)
first_pca_component["with scaling"] = scaled_pca.components_[0]
first_pca_component.plot.bar(
title="Weights of the first principal component", figsize=(6, 8)
)

_ = plt.tight_layout()

scaler = std_clf.named_steps["standardscaler"]
scaled_X_train = scaler.transform(X_train)
X_train_std_transformed = pca_std.transform(scaled_X_train)
# %%
# Indeed we find that the "proline" feature dominates the direction of the first
# principal component without scaling, being about two orders of magnitude above
# the other features. This is contrasted when observing the first principal
# component for the scaled version of the data, where the orders of magnitude
# are roughly the same across all the features.
#
# We can visualize the distribution of the principal components in both cases:

# visualize standardized vs. untouched dataset with PCA performed
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=FIG_SIZE)
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

target_classes = range(0, 3)
colors = ("blue", "red", "green")
Expand All @@ -125,7 +173,7 @@
marker=marker,
)

ax1.set_title("Training dataset after PCA")
ax1.set_title("Unscaled training dataset after PCA")
ax2.set_title("Standardized training dataset after PCA")

for ax in (ax1, ax2):
Expand All @@ -134,6 +182,74 @@
ax.legend(loc="upper right")
ax.grid()

plt.tight_layout()
_ = plt.tight_layout()

# %%
# From the plot above we observe that scaling the features before reducing the
# dimensionality results in components with the same order of magnitude. In this
# case it also improves the separability of the clases. Indeed, in the next
# section we confirm that a better separability has a good repercussion on the
# overall model's performance.
#
# Effect of rescaling on model's performance
# ==========================================
#
# First we show how the optimal regularization of a
# :class:`~sklearn.linear_model.LogisticRegressionCV` depends on the scaling or
# non-scaling of the data:

import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV

Cs = np.logspace(-5, 5, 20)

unscaled_clf = make_pipeline(pca, LogisticRegressionCV(Cs=Cs))
unscaled_clf.fit(X_train, y_train)

plt.show()
scaled_clf = make_pipeline(scaler, pca, LogisticRegressionCV(Cs=Cs))
scaled_clf.fit(X_train, y_train)

print(f"Optimal C for the unscaled PCA: {unscaled_clf[-1].C_[0]:.4f}\n")
print(f"Optimal C for the standardized data with PCA: {scaled_clf[-1].C_[0]:.2f}")

# %%
# The need for regularization is higher (lower values of `C`) for the data that
# was not scaled before applying PCA. We now evaluate the effect of scaling on
# the accuracy and the mean log-loss of the optimal models:

from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

y_pred = unscaled_clf.predict(X_test)
y_pred_scaled = scaled_clf.predict(X_test)
y_proba = unscaled_clf.predict_proba(X_test)
y_proba_scaled = scaled_clf.predict_proba(X_test)

print("Test accuracy for the unscaled PCA")
print(f"{accuracy_score(y_test, y_pred):.2%}\n")
print("Test accuracy for the standardized data with PCA")
print(f"{accuracy_score(y_test, y_pred_scaled):.2%}\n")
print("Log-loss for the unscaled PCA")
print(f"{log_loss(y_test, y_proba):.3}\n")
print("Log-loss for the standardized data with PCA")
print(f"{log_loss(y_test, y_proba_scaled):.3}")

# %%
# A clear difference in prediction accuracies is observed when the data is
# scaled before :class:`~sklearn.decomposition.PCA`, as it vastly outperforms
# the unscaled version. This corresponds to the intuition obtained from the plot
# in the previous section, where the components become linearly separable when
# sc 1E0A aling before using :class:`~sklearn.decomposition.PCA`.
#
# Notice that in this case the models with scaled features perform better than
# the models with non-scaled features because all the variables are expected to
# be predictive and we rather avoid some of them being comparatively ignored.
#
# If the variables in lower scales were not predictive, one may experience a
# decrease of the performance after scaling the features: noisy features would
# contribute more to the prediction after scaling and therefore scaling would
# increase overfitting.
#
# Last but not least, we observe that one achieves a lower log-loss by means of
# the scaling step.
0