8000 DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py by svenstehle · Pull Request #23184 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
23251ad
fix example plot_adaboost_hastie_10_2 to notebook-style
svenstehle Apr 22, 2022
fbe5421
move authors to bottom and add concluding remarks
svenstehle Apr 22, 2022
9686f2f
change authors formatting
svenstehle Apr 22, 2022
8e42548
Update examples/ensemble/plot_adaboost_hastie_10_2.py
svenstehle Apr 23, 2022
c2b08bc
Update examples/ensemble/plot_adaboost_hastie_10_2.py
svenstehle Apr 23, 2022
c5b5f30
make headings consistent
svenstehle Apr 23, 2022
3977ed1
move imports into cells of first use
svenstehle Apr 23, 2022
1142b97
move authors to top cell as comments
svenstehle Apr 23, 2022
ede0d30
remove '.' from citation so it does not concert 'J.' to '10'; add jou…
svenstehle Apr 23, 2022
7703a6d
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…
svenstehle Apr 26, 2022
6831eaa
Update examples/ensemble/plot_adaboost_hastie_10_2.py
svenstehle Apr 26, 2022
8e5d069
Update examples/ensemble/plot_adaboost_hastie_10_2.py
svenstehle Apr 26, 2022
eee33be
Update examples/ensemble/plot_adaboost_hastie_10_2.py
svenstehle Apr 26, 2022
2400509
Update examples/ensemble/plot_adaboost_hastie_10_2.py
svenstehle Apr 26, 2022
0374455
Update examples/ensemble/plot_adaboost_hastie_10_2.py
svenstehle Apr 26, 2022
016b862
merging main changes into branch
svenstehle Apr 26, 2022
35ad84e
Merge branch 'doc_ensemble_plot_adaboost_hastie_10_2' of https://gith…
svenstehle Apr 26, 2022
e1c3418
fix authors indentation
svenstehle Apr 26, 2022
8a11fbb
remove blank line
svenstehle Apr 26, 2022
712753d
update example for colorblindness and with train_test_split
svenstehle Apr 27, 2022
8e6f491
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…
svenstehle Apr 27, 2022
3b5ec4d
fix blank line
svenstehle Apr 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
8000
Diff view
94 changes: 71 additions & 23 deletions examples/ensemble/plot_adaboost_hastie_10_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Discrete versus Real AdaBoost
=============================

This example is based on Figure 10.2 from Hastie et al 2009 [1]_ and
This notebook is based on Figure 10.2 from Hastie et al 2009 [1]_ and
illustrates the difference in performance between the discrete SAMME [2]_
boosting algorithm and real SAMME.R boosting algorithm. Both algorithms are
evaluated on a binary classification task where the target Y is a non-linear
Expand All @@ -15,32 +15,44 @@
.. [1] T. Hastie, R. Tibshirani and J. Friedman, "Elements of Statistical
Learning Ed. 2", Springer, 2009.

.. [2] J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost", 2009.
.. [2] J Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost",
Statistics and Its Interface, 2009.

"""

# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
# Noel Dawe <noel.dawe@gmail.com>
# %%
# Preparing the data and baseline models
# --------------------------------------
# We start by generating the binary classification dataset
# used in Hastie et al. 2009, Example 10.2.

# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
# Noel Dawe <noel.dawe@gmail.com>
#
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import zero_one_loss
from sklearn.ensemble import AdaBoostClassifier

X, y = datasets.make_hastie_10_2(n_samples=12_000, random_state=1)

# %%
# Now, we set the hyperparameters for our AdaBoost classifiers.
# Be aware, a learning rate of 1.0 may not be optimal for both SAMME and SAMME.R

n_estimators = 400
# A learning rate of 1. may not be optimal for both SAMME and SAMME.R
learning_rate = 1.0

X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
# %%
# We split the data into a training and a test set.
# Then, we train our baseline classifiers, a `DecisionTreeClassifier` with `depth=9`
# and a "stump" `DecisionTreeClassifier` with `depth=1` and compute the test error.

X_test, y_test = X[2000:], y[2000:]
X_train, y_train = X[:2000], y[:2000]
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=2_000, shuffle=False
)

dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
dt_stump.fit(X_train, y_train)
Expand All @@ -50,6 +62,14 @@
dt.fit(X_train, y_train)
dt_err = 1.0 - dt.score(X_test, y_test)

# %%
# Adaboost with discrete SAMME and real SAMME.R
# ---------------------------------------------
# We now define the discrete and real AdaBoost classifiers
# and fit them to the training set.

from sklearn.ensemble import AdaBoostClassifier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a marker # %% in l.78 to get a diagram for both models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yes

ada_discrete = AdaBoostClassifier(
base_estimator=dt_stump,
learning_rate=learning_rate,
Expand All @@ -58,6 +78,8 @@
)
ada_discrete.fit(X_train, y_train)

# %%

ada_real = AdaBoostClassifier(
base_estimator=dt_stump,
learning_rate=learning_rate,
Expand All @@ -66,11 +88,13 @@
)
ada_real.fit(X_train, y_train)

fig = plt.figure()
ax = fig.add_subplot(111)
# %%
# Now, let's compute the test error of the discrete and
# real AdaBoost classifiers for each new stump in `n_estimators`
# added to the ensemble.

ax.plot([1, n_estimators], [dt_stump_err] * 2, "k-", label="Decision Stump Error")
ax.plot([1, n_estimators], [dt_err] * 2, "k--", label="Decision Tree Error")
import numpy as np
from sklearn.metrics import zero_one_loss

ada_discrete_err = np.zeros((n_estimators,))
for i, y_pred in enumerate(ada_discrete.staged_predict(X_test)):
Expand All @@ -88,36 +112,60 @@
for i, y_pred in enumerate(ada_real.staged_predict(X_train)):
ada_real_err_train[i] = zero_one_loss(y_pred, y_train)

# %%
# Plotting the results
# --------------------
# Finally, we plot the train and test errors of our baselines
# and of the discrete and real AdaBoost classifiers

import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot([1, n_estimators], [dt_stump_err] * 2, "k-", label="Decision Stump Error")
ax.plot([1, n_estimators], [dt_err] * 2, "k--", label="Decision Tree Error")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the figure below, we can remove the color argument each time.
We can use the default colour that should be more suited to colourblindness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to improvise a bit here since the default colors include both red and green, which is the most frequent type of color-blindness according to matplotlib

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this.

colors = sns.color_palette("colorblind")

ax.plot(
np.arange(n_estimators) + 1,
ada_discrete_err,
label="Discrete AdaBoost Test Error",
color="red",
color=colors[0],
)
ax.plot(
np.arange(n_estimators) + 1,
ada_discrete_err_train,
label="Discrete AdaBoost Train Error",
color="blue",
color=colors[1],
)
ax.plot(
np.arange(n_estimators) + 1,
ada_real_err,
label="Real AdaBoost Test Error",
color="orange",
color=colors[2],
)
ax.plot(
np.arange(n_estimators) + 1,
ada_real_err_train,
label="Real AdaBoost Train Error",
color="green",
color=colors[4],
)

ax.set_ylim((0.0, 0.5))
ax.set_xlabel("n_estimators")
ax.set_xlabel("Number of weak learners")
ax.set_ylabel("error rate")

leg = ax.legend(loc="upper right", fancybox=True)
leg.get_frame().set_alpha(0.7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In l.151, can you change ax.set_xlabel("n_estimators") by:

ax.set_xlabel("Number of weak learners")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will do


plt.show()
# %%
#
# Concluding remarks
# ------------------
#
# We observe that the error rate for both train and test sets of real AdaBoost
# is lower than that of discrete AdaBoost.
0