8000 DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py by svenstehle · Pull Request #23184 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py #23184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

svenstehle
Copy link
Contributor
@svenstehle svenstehle commented Apr 22, 2022

Reference Issues/PRs

Updates ensemble/plot_adaboost_hastie_10_2.py
For Issue #22406 Fix notebook-style examples

What does this implement/fix? Explain your changes.

Updated the example plot_adaboost_hastie_10_2.py to notebook style.

  • Changed the order of plots and added new text.
  • I am unhappy with the way the citation is displayed though. For me, both in the original and in my PR, it looks cramped and there appears to be a 10 where there should not be: [2]10 Zhu, H.
  • I moved the authors to the bottom. Do we remove them altogether; format them differently; move them somewhere else?

Any other comments?

Happy to receive feedback and implement improvements. I think notebook-style is an improvement.

@svenstehle
Copy link
Contributor Author

Any ideas on why scikit-learn.scikit-learn is failing? Following the details and the link results in:

Windows py38_conda_forge_mkl

View raw log

##[error]The job running on agent Azure Pipelines 13 ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134

I could not find more information on this when I googled, is this a common issue?

Copy link
Contributor
@jsilke jsilke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I have a few suggestions that I believe may help. Please let me know what you think.

Any ideas on why scikit-learn.scikit-learn is failing?

I believe this has now been addressed (see #23185)

Comment on lines 22 to 25
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
# Noel Dawe <noel.dawe@gmail.com>
#
# License: BSD 3 clause
Copy link
Contributor
@jsilke jsilke Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the reworked notebook examples I have seen tend to keep the authors in this location as their own cell (i.e. keep the authors here and add # %% on line 21). I am not strongly opinionated on this personally, but it may be easier to follow that convention for consistency/simplicity.

To clarify: I think what you have done here is fine, but perhaps someone else can weigh in on this point if they feel strongly one way or the other.

Edit: I may be misremembering about adding # %% on the previous line. I think many simply leave the commented authors near the top unaltered.

# Hastie et al. (2009) example 10.2
# ---------------------------------------------------
# We start by generating the binary classification dataset
# used in Hastie et al. 2009, Example 10.2.

import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nicer to move each import to the top of the cell in which it is first used. This is also in keeping with many of the other reworked notebook examples.

@svenstehle
Copy link
Contributor Author

Hi @jsilke and thank you for the great and really quick feedback. I think your recommendations make sense and I will implement most of them verbatim.

On the topic of the authors I am still not quite sure what the best way to handle this is. I think this needs further discussion. Maybe other contributors/reviewers want to chime in here:

  • I think you are right on the position. When they are featured in an example, which is the case in roughly half of the examples in /ensemble/ that I skimmed through right now, they are put into the import cell on the top of the notebook-style example
  • However, in the other half of examples, usually in notebook-style examples, they have been dropped altogether. It is rare to see them in a notebook-style example
    -My personal opinion is that we should keep this consistent. Either feature authors/contributors in every example or drop them. After all, these PRs are about making things consistent.
  • Authors/contributors of a certain line can be looked up at any time in the code using e.g. git blame and VSCode in-line annotations on mouse-over.
  • Maybe we should raise this topic in the general thread... or open a separate issue about the author-topic and not discuss this here to reduce complexity? If authors are feature or not is not relevant for the quality of the example after all... So I am good with it either way

What do you think?

Separate topic about the citation in the heading "docstring":
See this example for the bad formatting that is also happening in our case here. What can we do about it?
image

svenstehle and others added 5 commits April 23, 2022 12:27
make hyphens under heading consistent

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
change results plotting into its own section

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
@svenstehle
Copy link
Contributor Author
svenstehle commented Apr 23, 2022

I investigated the citation problem. We have four choices that I found so far:

  • Leave as is --> J. Zhu gets converted to 10 Zhu
  • Escape the . --> J\. Zhu gets converted to J. Zhu, which looks correct; but: flake8 is saying we are not allowed to escape a ., thus I have to add # noqa to the line to ignore that flake8 error
  • Insert a space after J --> J . Zhu
  • remove the . altogether --> J Zhu

I am in favor of presenting it in the correct way and disable flake8 for that citation line. What do you think?

EDIT: scratch option 2, I can disable flake8 but that # noqa just shows up in the citation which is even worse :D
Therefore I vote for removing the ., which hurts at least my eyes the least:

image

Copy link
Contributor
@jsilke jsilke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore I vote for removing the ., which hurts at least my eyes the least

I think this is a sensible solution and I agree that it looks better than the original. Thank you for taking the time to explore some options here!

I realize now that my comment regarding the consistency of underline length was not written clearly, my apologies. What I meant to say was that the number of - characters should match the number of characters in the heading for each heading. Please see my suggested changes.

I have also added a couple of other comments. Please take a look when you get the chance and let me know what you think. Apart from that I think the example looks great, and thank you again for your effort here!

Comment on lines 29 to 32
# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
# Noel Dawe <noel.dawe@gmail.com>
#
# License: BSD 3 clause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be ultimately a minor detail but, if I recall correctly, to be consistent this block should be moved above this cell. Something to the effect of:

-29# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
-30# Noel Dawe <noel.dawe@gmail.com>
-31#
-32# License: BSD 3 clause

+23# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
+24# Noel Dawe <noel.dawe@gmail.com>
+25#
+26# License: BSD 3 clause
+27
+28# %%
+29# Preparing the data and baseline models

If you did notice many notebooks that were rendered in the same manner as what you have here, then I think the point is fine to disregard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jsilke and thanks for your feedback, I appreciate it :)

Now I understand what you meant with your comment about the headings! I interpreted it as a common length instead. Updated accordingly. Your point about lower error rate is a good one. Yes indeed, we should be more literal here. Even though the test error looks fine, the training error is almost 0 and we can state it as it is.

I will change the authors like you suggested, consistency is more important here than actual placement. If we all go forward with this kind of placement, then that is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be ultimately a minor detail but, if I recall correctly, to be consistent this block should be moved above this cell. Something to the effect of:

-29# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
-30# Noel Dawe <noel.dawe@gmail.com>
-31#
-32# License: BSD 3 clause

+23# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
+24# Noel Dawe <noel.dawe@gmail.com>
+25#
+26# License: BSD 3 clause
+27
+28# %%
+29# Preparing the data and baseline models

If you did notice many notebooks that were rendered in the same manner as what you have here, then I think the point is fine to disregard.

Followed up on this.

These examples have authors as comments in the same cell above the first code:

At least in the notebook-style examples in /ensemble/ I have never seen the author cell being completely separate. I think we we should not diverge from that style-choice for the sake of consistency. See my current commit on this. What do you think?

svenstehle and others added 9 commits April 26, 2022 18:42
…nto doc_ensemble_plot_adaboost_hastie_10_2
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Copy link
Contributor
@jsilke jsilke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@glemaitre glemaitre self-requested a review April 26, 2022 21:04
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this example.
I propose a couple of enhancements before merging this example.

Otherwise LGTM.

# and fit them to the training set.

from sklearn.ensemble import AdaBoostClassifier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a marker # %% in l.78 to get a diagram for both models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yes

Comment on lines 35 to 36

X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1)
from sklearn.model_selection import train_test_split
X, y = datasets.make_hastie_10_2(n_samples=12_000, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=2_000, shuffle=False
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the train_test_split to the actual split part down below though. The loading of the dataset remains in its own cell. I actually like the flow of that a bit better for the example.

Comment on lines 52 to 53
X_test, y_test = X[2000:], y[2000:]
X_train, y_train = X[:2000], y[:2000]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove these 2 lines and use the train_test_split function as suggested above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, it's a good update to the example code.


ax.plot([1, n_estimators], [dt_stump_err] * 2, "k-", label="Decision Stump Error")
ax.plot([1, n_estimators], [dt_err] * 2, "k--", label="Decision Tree Error")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the figure below, we can remove the color argument each time.
We can use the default colour that should be more suited to colourblindness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to improvise a bit here since the default colors include both red and green, which is the most frequent type of color-blindness according to matplotlib

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this.

@@ -121,3 +155,11 @@
leg.get_frame().set_alpha(0.7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In l.151, can you change ax.set_xlabel("n_estimators") by:

ax.set_xlabel("Number of weak learners")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will do

@lesteve lesteve added the Quick Review For PRs that are quick to review label Apr 27, 2022
@svenstehle
Copy link
Contributor Author

So lots of thanks for your quick and valuable review and feeback @jsilke and @glemaitre :)

I think that updates not only the layout of the example but also improves the presentation of the content. Tell me what you think of the current version.

@glemaitre glemaitre merged commit 0a07517 into scikit-learn:main Apr 29, 2022
@glemaitre
Copy link
Member

LGTM. Thanks @svenstehle @jsilke for the changes and reviews.
Merging.

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
…it-learn#23184)

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
@svenstehle svenstehle deleted the doc_ensemble_plot_adaboost_hastie_10_2 branch May 16, 2022 19:02
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request May 19, 2022
…it-learn#23184)

Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
glemaitre pushed a commit that referenced this pull request May 19, 2022
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Quick Review For PRs that are quick to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0