DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py #23184

svenstehle · 2022-04-22T08:34:53Z

Reference Issues/PRs

Updates ensemble/plot_adaboost_hastie_10_2.py
For Issue #22406 Fix notebook-style examples

What does this implement/fix? Explain your changes.

Updated the example plot_adaboost_hastie_10_2.py to notebook style.

Changed the order of plots and added new text.
I am unhappy with the way the citation is disp 8000 layed though. For me, both in the original and in my PR, it looks cramped and there appears to be a 10 where there should not be: [2]10 Zhu, H.
I moved the authors to the bottom. Do we remove them altogether; format them differently; move them somewhere else?

Any other comments?

Happy to receive feedback and implement improvements. I think notebook-style is an improvement.

svenstehle · 2022-04-22T13:11:38Z

Any ideas on why scikit-learn.scikit-learn is failing? Following the details and the link results in:

Windows py38_conda_forge_mkl

View raw log

##[error]The job running on agent Azure Pipelines 13 ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134

I could not find more information on this when I googled, is this a common issue?

jsilke

Thank you for the PR! I have a few suggestions that I believe may help. Please let me know what you think.

Any ideas on why scikit-learn.scikit-learn is failing?

I believe this has now been addressed (see #23185)

jsilke · 2022-04-22T16:53:09Z

examples/ensemble/plot_adaboost_hastie_10_2.py

-# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
-#         Noel Dawe <noel.dawe@gmail.com>
-#
-# License: BSD 3 clause


Most of the reworked notebook examples I have seen tend to keep the authors in this location as their own cell (i.e. keep the authors here and add # %% on line 21). I am not strongly opinionated on this personally, but it may be easier to follow that convention for consistency/simplicity.

To clarify: I think what you have done here is fine, but perhaps someone else can weigh in on this point if they feel strongly one way or the other.

Edit: I may be misremembering about adding # %% on the previous line. I think many simply leave the commented authors near the top unaltered.

examples/ensemble/plot_adaboost_hastie_10_2.py

jsilke · 2022-04-22T17:05:25Z

examples/ensemble/plot_adaboost_hastie_10_2.py

+# Hastie et al. (2009) example 10.2
+# ---------------------------------------------------
+# We start by generating the binary classification dataset
+# used in Hastie et al. 2009, Example 10.2.

 import numpy as np


It may be nicer to move each import to the top of the cell in which it is first used. This is also in keeping with many of the other reworked notebook examples.

examples/ensemble/plot_adaboost_hastie_10_2.py

svenstehle · 2022-04-23T10:23:15Z

Hi @jsilke and thank you for the great and really quick feedback. I think your recommendations make sense and I will implement most of them verbatim.

On the topic of the authors I am still not quite sure what the best way to handle this is. I think this needs further discussion. Maybe other contributors/reviewers want to chime in here:

I think you are right on the position. When they are featured in an example, which is the case in roughly half of the examples in /ensemble/ that I skimmed through right now, they are put into the import cell on the top of the notebook-style example
However, in the other half of examples, usually in notebook-style examples, they have been dropped altogether. It is rare to see them in a notebook-style example
-My personal opinion is that we should keep this consistent. Either feature authors/contributors in every example or drop them. After all, these PRs are about making things consistent.
Authors/contributors of a certain line can be looked up at any time in the code using e.g. git blame and VSCode in-line annotations on mouse-over.
Maybe we should raise this topic in the general thread... or open a separate issue about the author-topic and not discuss this here to reduce complexity? If authors are feature or not is not relevant for the quality of the example after all... So I am good with it either way

What do you think?

Separate topic about the citation in the heading "docstring":
See this example for the bad formatting that is also happening in our case here. What can we do about it?

make hyphens under heading consistent Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

change results plotting into its own section Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

svenstehle · 2022-04-23T11:10:04Z

I investigated the citation problem. We have four choices that I found so far:

Leave as is --> J. Zhu gets converted to 10 Zhu
Escape the . --> J\. Zhu gets converted to J. Zhu, which looks correct; but: flake8 is saying we are not allowed to escape a ., thus I have to add # noqa to the line to ignore that flake8 error
Insert a space after J --> J . Zhu
remove the . altogether --> J Zhu

I am in favor of presenting it in the correct way and disable flake8 for that citation line. What do you think?

EDIT: scratch option 2, I can disable flake8 but that # noqa just shows up in the citation which is even worse :D
Therefore I vote for removing the ., which hurts at least my eyes the least:

…rnal name to citation

jsilke

Therefore I vote for removing the ., which hurts at least my eyes the least

I think this is a sensible solution and I agree that it looks better than the original. Thank you for taking the time to explore some options here!

I realize now that my comment regarding the consistency of underline length was not written clearly, my apologies. What I meant to say was that the number of - characters should match the number of characters in the heading for each heading. Please see my suggested changes.

I have also added a couple of other comments. Please take a look when you get the chance and let me know what you think. Apart from that I think the example looks great, and thank you again for your effort here!

examples/ensemble/plot_adaboost_hastie_10_2.py

jsilke · 2022-04-25T19:14:50Z

examples/ensemble/plot_adaboost_hastie_10_2.py

+# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
+# Noel Dawe <noel.dawe@gmail.com>
 #
 # License: BSD 3 clause


This may be ultimately a minor detail but, if I recall correctly, to be consistent this block should be moved above this cell. Something to the effect of:

-29# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>, -30# Noel Dawe <noel.dawe@gmail.com> -31# -32# License: BSD 3 clause +23# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>, +24# Noel Dawe <noel.dawe@gmail.com> +25# +26# License: BSD 3 clause +27 +28# %% +29# Preparing the data and baseline models

If you did notice many notebooks that were rendered in the same manner as what you have here, then I think the point is fine to disregard.

Hi @jsilke and thanks for your feedback, I appreciate it :)

Now I understand what you meant with your comment about the headings! I interpreted it as a common length instead. Updated accordingly. Your point about lower error rate is a good one. Yes indeed, we should be more literal here. Even though the test error looks fine, the training error is almost 0 and we can state it as it is.

I will change the authors like you suggested, consistency is more important here than actual placement. If we all go forward with this kind of placement, then that is fine.

This may be ultimately a minor detail but, if I recall correctly, to be consistent this block should be moved above this cell. Something to the effect of:

-29# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>, -30# Noel Dawe <noel.dawe@gmail.com> -31# -32# License: BSD 3 clause +23# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>, +24# Noel Dawe <noel.dawe@gmail.com> +25# +26# License: BSD 3 clause +27 +28# %% +29# Preparing the data and baseline models

If you did notice many notebooks that were rendered in the same manner as what you have here, then I think the point is fine to disregard.

Followed up on this.

These examples have authors as comments in the same cell above the first code:

https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html#sphx-glr-auto-examples-ensemble-plot-stack-predictors-py

https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-early-stopping-py

https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py

https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py

At least in the notebook-style examples in /ensemble/ I have never seen the author cell being completely separate. I think we we should not diverge from that style-choice for the sake of consistency. See my current commit on this. What do you think?

…nto doc_ensemble_plot_adaboost_hastie_10_2

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

…ub.com/svenstehle/scikit-learn into doc_ensemble_plot_adaboost_hastie_10_2

jsilke

LGTM!

glemaitre

Thanks for fixing this example.
I propose a couple of enhancements before merging this example.

Otherwise LGTM.

glemaitre · 2022-04-26T21:09:54Z

examples/ensemble/plot_adaboost_hastie_10_2.py

+# and fit them to the training set.
+
+from sklearn.ensemble import AdaBoostClassifier
+


Can you add a marker # %% in l.78 to get a diagram for both models?

Good catch. Yes

glemaitre · 2022-04-26T21:12:44Z

examples/ensemble/plot_adaboost_hastie_10_2.py


+X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)


Suggested change

X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)

from sklearn.model_selection import train_test_split

X, y = datasets.make_hastie_10_2(n_samples=12_000, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=2_000, shuffle=False

)

I moved the train_test_split to the actual split part down below though. The loading of the dataset remains in its own cell. I actually like the flow of that a bit better for the example.

glemaitre · 2022-04-26T21:13:09Z

examples/ensemble/plot_adaboost_hastie_10_2.py

 X_test, y_test = X[2000:], y[2000:]
 X_train, y_train = X[:2000], y[:2000]


We can remove these 2 lines and use the train_test_split function as suggested above.

I like it, it's a good update to the example code.

glemaitre · 2022-04-26T21:15:13Z

examples/ensemble/plot_adaboost_hastie_10_2.py

+
+ax.plot([1, n_estimators], [dt_stump_err] * 2, "k-", label="Decision Stump Error")
+ax.plot([1, n_estimators], [dt_err] * 2, "k--", label="Decision Tree Error")
+


In the figure below, we can remove the color argument each time.
We can use the default colour that should be more suited to colourblindness.

Good point!

I have to improvise a bit here since the default colors include both red and green, which is the most frequent type of color-blindness according to matplotlib

Thanks for looking at this.

glemaitre · 2022-04-26T21:15:58Z

examples/ensemble/plot_adaboost_hastie_10_2.py

@@ -121,3 +155,11 @@
 leg.get_frame().set_alpha(0.7)


In l.151, can you change ax.set_xlabel("n_estimators") by:

ax.set_xlabel("Number of weak learners")

Sounds good, will do

svenstehle · 2022-04-27T19:29:40Z

So lots of thanks for your quick and valuable review and feeback @jsilke and @glemaitre :)

I think that updates not only the layout of the example but also improves the presentation of the content. Tell me what you think of the current version.

…nto doc_ensemble_plot_adaboost_hastie_10_2

glemaitre · 2022-04-29T09:46:04Z

LGTM. Thanks @svenstehle @jsilke for the changes and reviews.
Merging.

…it-learn#23184) Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

fix example plot_adaboost_hastie_10_2 to notebook-style

23251ad

github-actions bot added the Documentation label Apr 22, 2022

svenstehle added 2 commits April 22, 2022 11:38

move authors to bottom and add concluding remarks

fbe5421

change authors formatting

9686f2f

jsilke reviewed Apr 22, 2022

View reviewed changes

svenstehle and others added 5 commits April 23, 2022 12:27

Update examples/ensemble/plot_adaboost_hastie_10_2.py

8e42548

make hyphens under heading consistent Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

Update examples/ensemble/plot_adaboost_hastie_10_2.py

c2b08bc

change results plotting into its own section Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

make headings consistent

c5b5f30

move imports into cells of first use

3977ed1

move authors to top cell as comments

1142b97

remove '.' from citation so it does not concert 'J.' to '10'; add jou…

ede0d30

…rnal name to citation

jsilke reviewed Apr 25, 2022

View reviewed changes

svenstehle and others added 9 commits April 26, 2022 18:42

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

7703a6d

…nto doc_ensemble_plot_adaboost_hastie_10_2

Update examples/ensemble/plot_adaboost_hastie_10_2.py

6831eaa

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

Update examples/ensemble/plot_adaboost_hastie_10_2.py

8e5d069

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

Update examples/ensemble/plot_adaboost_hastie_10_2.py

eee33be

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

Update examples/ensemble/plot_adaboost_hastie_10_2.py

2400509

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

Update examples/ensemble/plot_adaboost_hastie_10_2.py

0374455

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

merging main changes into branch

016b862

Merge branch 'doc_ensemble_plot_adaboost_hastie_10_2' of https://gith…

35ad84e

…ub.com/svenstehle/scikit-learn into doc_ensemble_plot_adaboost_hastie_10_2

fix authors indentation

e1c3418

jsilke approved these changes Apr 26, 2022

View reviewed changes

remove blank line

8a11fbb

glemaitre self-requested a review April 26, 2022 21:04

glemaitre reviewed Apr 26, 2022

View reviewed changes

lesteve added the Quick Review For PRs that are quick to review label Apr 27, 2022

svenstehle added 3 commits April 27, 2022 21:31

update example for colorblindness and with train_test_split

712753d

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

8e6f491

…nto doc_ensemble_plot_adaboost_hastie_10_2

fix blank line

3b5ec4d

glemaitre merged commit 0a07517 into scikit-learn:main Apr 29, 2022

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022

DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py (scik…

dff3601

…it-learn#23184) Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

svenstehle deleted the doc_ensemble_plot_adaboost_hastie_10_2 branch May 16, 2022 19:02

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request May 19, 2022

DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py (scik…

7f65b74

…it-learn#23184) Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

glemaitre pushed a commit that referenced this pull request May 19, 2022

DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py (#23184)

138ae00

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>

		# and fit them to the training set.

		from sklearn.ensemble import AdaBoostClassifier


		X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)

-X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
+from sklearn.model_selection import train_test_split
+X, y = datasets.make_hastie_10_2(n_samples=12_000, random_state=1)
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=2_000, shuffle=False
+)

		X_test, y_test = X[2000:], y[2000:]
		X_train, y_train = X[:2000], y[:2000]


		ax.plot([1, n_estimators], [dt_stump_err] * 2, "k-", label="Decision Stump Error")
		ax.plot([1, n_estimators], [dt_err] * 2, "k--", label="Decision Tree Error")

Uh oh!

DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py #23184

DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py #23184

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!