-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py #23184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC use notebook-style in ensemble/plot_adaboost_hastie_10_2.py #23184
Conversation
Any ideas on why Windows py38_conda_forge_mkl ##[error]The job running on agent Azure Pipelines 13 ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134 I could not find more information on this when I googled, is this a common issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! I have a few suggestions that I believe may help. Please let me know what you think.
Any ideas on why scikit-learn.scikit-learn is failing?
I believe this has now been addressed (see #23185)
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>, | ||
# Noel Dawe <noel.dawe@gmail.com> | ||
# | ||
# License: BSD 3 clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the reworked notebook examples I have seen tend to keep the authors in this location as their own cell (i.e. keep the authors here and add # %%
on line 21). I am not strongly opinionated on this personally, but it may be easier to follow that convention for consistency/simplicity.
To clarify: I think what you have done here is fine, but perhaps someone else can weigh in on this point if they feel strongly one way or the other.
Edit: I may be misremembering about adding # %%
on the previous line. I think many simply leave the commented authors near the top unaltered.
# Hastie et al. (2009) example 10.2 | ||
# --------------------------------------------------- | ||
# We start by generating the binary classification dataset | ||
# used in Hastie et al. 2009, Example 10.2. | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be nicer to move each import to the top of the cell in which it is first used. This is also in keeping with many of the other reworked notebook examples.
Hi @jsilke and thank you for the great and really quick feedback. I think your recommendations make sense and I will implement most of them verbatim. On the topic of the authors I am still not quite sure what the best way to handle this is. I think this needs further discussion. Maybe other contributors/reviewers want to chime in here:
What do you think? Separate topic about the citation in the heading "docstring": |
make hyphens under heading consistent Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
change results plotting into its own section Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
I investigated the citation problem. We have four choices that I found so far:
I am in favor of presenting it in the correct way and disable flake8 for that citation line. What do you think? EDIT: scratch option 2, I can disable flake8 but that |
…rnal name to citation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore I vote for removing the
.
, which hurts at least my eyes the least
I think this is a sensible solution and I agree that it looks better than the original. Thank you for taking the time to explore some options here!
I realize now that my comment regarding the consistency of underline length was not written clearly, my apologies. What I meant to say was that the number of -
characters should match the number of characters in the heading for each heading. Please see my suggested changes.
I have also added a couple of other comments. Please take a look when you get the chance and let me know what you think. Apart from that I think the example looks great, and thank you again for your effort here!
# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>, | ||
# Noel Dawe <noel.dawe@gmail.com> | ||
# | ||
# License: BSD 3 clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be ultimately a minor detail but, if I recall correctly, to be consistent this block should be moved above this cell. Something to the effect of:
-29# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
-30# Noel Dawe <noel.dawe@gmail.com>
-31#
-32# License: BSD 3 clause
+23# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>,
+24# Noel Dawe <noel.dawe@gmail.com>
+25#
+26# License: BSD 3 clause
+27
+28# %%
+29# Preparing the data and baseline models
If you did notice many notebooks that were rendered in the same manner as what you have here, then I think the point is fine to disregard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jsilke and thanks for your feedback, I appreciate it :)
Now I understand what you meant with your comment about the headings! I interpreted it as a common length instead. Updated accordingly. Your point about lower error rate is a good one. Yes indeed, we should be more literal here. Even though the test error looks fine, the training error is almost 0 and we can state it as it is.
I will change the authors like you suggested, consistency is more important here than actual placement. If we all go forward with this kind of placement, then that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be ultimately a minor detail but, if I recall correctly, to be consistent this block should be moved above this cell. Something to the effect of:
-29# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>, -30# Noel Dawe <noel.dawe@gmail.com> -31# -32# License: BSD 3 clause +23# Authors: Peter Prettenhofer <peter.prettenhofer@gmail.com>, +24# Noel Dawe <noel.dawe@gmail.com> +25# +26# License: BSD 3 clause +27 +28# %% +29# Preparing the data and baseline modelsIf you did notice many notebooks that were rendered in the same manner as what you have here, then I think the point is fine to disregard.
Followed up on this.
These examples have authors as comments in the same cell above the first code:
- https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html#sphx-glr-auto-examples-ensemble-plot-stack-predictors-py
- https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-early-stopping-py
- https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#sphx-glr-auto-examples-ensemble-plot-feature-transformation-py
- https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py
At least in the notebook-style examples in /ensemble/
I have never seen the author cell being completely separate. I think we we should not diverge from that style-choice for the sake of consistency. See my current commit on this. What do you think?
…nto doc_ensemble_plot_adaboost_hastie_10_2
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
…ub.com/svenstehle/scikit-learn into doc_ensemble_plot_adaboost_hastie_10_2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this example.
I propose a couple of enhancements before merging this example.
Otherwise LGTM.
# and fit them to the training set. | ||
|
||
from sklearn.ensemble import AdaBoostClassifier | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a marker # %%
in l.78 to get a diagram for both models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Yes
|
||
X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X, y = datasets.make_hastie_10_2(n_samples=12000, random_state=1) | |
from sklearn.model_selection import train_test_split | |
X, y = datasets.make_hastie_10_2(n_samples=12_000, random_state=1) | |
X_train, X_test, y_train, y_test = train_test_split( | |
X, y, test_size=2_000, shuffle=False | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the train_test_split to the actual split part down below though. The loading of the dataset remains in its own cell. I actually like the flow of that a bit better for the example.
X_test, y_test = X[2000:], y[2000:] | ||
X_train, y_train = X[:2000], y[:2000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove these 2 lines and use the train_test_split
function as suggested above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, it's a good update to the example code.
|
||
ax.plot([1, n_estimators], [dt_stump_err] * 2, "k-", label="Decision Stump Error") | ||
ax.plot([1, n_estimators], [dt_err] * 2, "k--", label="Decision Tree Error") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the figure below, we can remove the color
argument each time.
We can use the default colour that should be more suited to colourblindness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to improvise a bit here since the default colors include both red and green, which is the most frequent type of color-blindness according to matplotlib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this.
@@ -121,3 +155,11 @@ | |||
leg.get_frame().set_alpha(0.7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In l.151, can you change ax.set_xlabel("n_estimators")
by:
ax.set_xlabel("Number of weak learners")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will do
So lots of thanks for your quick and valuable review and feeback @jsilke and @glemaitre :) I think that updates not only the layout of the example but also improves the presentation of the content. Tell me what you think of the current version. |
…nto doc_ensemble_plot_adaboost_hastie_10_2
LGTM. Thanks @svenstehle @jsilke for the changes and reviews. |
…it-learn#23184) Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
…it-learn#23184) Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Co-authored-by: Jordan Silke <51223540+jsilke@users.noreply.github.com>
Reference Issues/PRs
Updates
ensemble/plot_adaboost_hastie_10_2.py
For Issue #22406 Fix notebook-style examples
What does this implement/fix? Explain your changes.
Updated the example plot_adaboost_hastie_10_2.py to notebook style.
10
where there should not be:[2]10 Zhu, H.
Any other comments?
Happy to receive feedback and implement improvements. I think notebook-style is an improvement.