-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Add example showcasing HGBT regression #26991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Add example showcasing HGBT regression #26991
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first iteration.
# %% | ||
# Notice energy transfer increases systematically during weekends. | ||
# | ||
# Effect of number of trees in HistGradientBoostingRegressor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we should change this section to highlight early stopping and finding a good pair of (max_iter, learning_rate).
I consider this important as it is shown nowhere and usually the first step to train a HGBT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need help here, just say so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we should change this section to highlight early stopping and finding a good pair of (
max_iter
,learning_rate
)
I guess you mean by doing a grid search over those 2 parameters and exploring the respective n_iter_
attribute, right? Or what do you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grid search is overkill, even not so helpful. Just choose one learning rate, fit the HBGT on the training set and determine the best max_iter
via early stopping and some explicit validation_fraction
. From there on, use this max_iter and don't use early stopping anymore.
Once, this PR and #27124 are merged, we can add cross validation for this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into the issue reported in #25460, as early_stopping
uses shuffle=True
for the internal validation, which is not the right thing to do when dealing with time series, as done in this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. But for the time being, it is better to "learn" max iter with the current implementation and early stopping than to rely on defaults.
After all, max iter and the learning rate are the most important parameters.
I would just add a comment that it is not optimal for time series ATM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 9a486b8.
…nto hgbt_new_example
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…n into hgbt_new_example
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
I think I have addressed all the comments. Please let me know otherwise. |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…nto hgbt_new_example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks already good to me. I'm just wondering that we don't show the automatic way of dealing with categorical data. The variable day
is actually a "category"
variable. I'm wondering if we could not simplify modify the code and mention that we have this new way to detecting and encoding the categorical variable internally.
_ = ax.legend() | ||
|
||
# %% | ||
# With just a few iterations, HGBT models can achieve convergence (see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a sentence to describe what we see on the figure. 5 iterations is not enough to be able to predict. But at 50, we are already able to do a good job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable
day
is actually a "category" variable.
The thing is that day
is already ordinal encoded, so I don't think we can demo the categorical variables support here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not really matter. Since this is a category columns, the tree will treat it as such.
A potential thing that we could do is to replace the integers by the day as string at the beginning. I don't know if eventually it removes some boilerplate when plotting because we already have the name instead of remapping the integer to the days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This said, I would be happy to merge this PR as-is because it is a net improvement and see if we can further improve it with this increment.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…nto hgbt_new_example
…n into hgbt_new_example
Should I push my own green button? |
No. @glemaitre you did the latest review. Fine to merge? |
Yes. Fine To merge. @lorentzenchr do you have further changes (just because the status is not approved on your side). |
I approved a month ago and then you requested a review by me. |
Whoops my bad, I probably misclicked. Sorry. Sent from my iPhoneOn 22 Feb 2024, at 18:53, Christian Lorentzen ***@***.***> wrote:
Merged #26991 into main.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Very happy about this! Congratulation to every one involved. Nitpick: I would remove "advanced" from the title, as I think that many people should look at this example, and not only advanced people. If y'all agree, I'll send a PR that does nothing else than remove this word 😀 😀 |
@GaelVaroquaux Please go on and ping me for quick review if you‘d like. |
Here it is #28508 I was not true to my words :$ I removed a tiny bit more than "advanced" :) Thanks!! |
Reference Issues/PRs
Fixes #26826. See also #21967 and #23746 on missing values documentation.
What does this implement/fix? Explain your changes.
This PR adds an example to:
Any other comments?
The original issue suggests also demoing support of categorical values, but we already have Categorical Feature Support in Gradient Boosting, which is only linked in the present example as it is a very good example itself.
Indeed, we also have a Monotonic constraints example but it can be merged with the example from this PR.