-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH partial_dependece plot for HistGradientBoosting estimator fitted with sample_weight
#25210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is not clear to me why it would be hard to keep track of
@adrinjalali, would you mind explaining why do you find it non-trivial to do? I would love to try to solve this issue if I can get a little push. :D |
IIRC it's because it'd increase the memory footprint of the model, as well as slowing the model down. I'm not opposed to adding them, but adding them in #14696 would have meant requiring more benchmarks, and more tests, which wasn't ideal since that PR took quite a while and quite an effort to finish in the first place. It's certainly doable as a separate PR, and happy if you're up for pushing a PR for it, but be warned that it might be a non-trivial PR 😁 |
@vitaliset, have you started working on this? I would like to take this on as well. |
Feel free to jump in and take it @Andrew-Wang-IB45. I will not be able to start it in the next few weeks and would be super happy if you could come with something in the meantime! Please take a look at the issues I mentioned in the description of this one (especially 24872) for discussions related to the solution. From my point of view, the solution involves two parts (maybe we should have two sequential PRs for easier reviews): 1 - Update the training function of the base estimator of HistGradientBoosting, so it keeps track of the
2 - Update the What do you think? Does this approach make sense to you? Please reach out and ask for help if you need it! :) |
Sounds good, @vitaliset. I will review the issues and pull requests that you have brought up. Your approach makes sense and I will follow that when implementing the solution. I'll let you know how things are coming along and when I need help. Thanks! |
I edited the last comment as the test I suggested does not make sense: when training with Nonetheless, some small tests should be adapted once the fix done. For instance:
|
Hi @vitaliset, I thought about an approach to implement your solution and would like some clarifications. I noticed for the BaseDecisionTree class, scikit-learn/sklearn/tree/_classes.py Line 379 in 9e08ed2
My understanding is that for the BaseHistGradientBoosting class, scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Lines 673 to 689 in 0a36bd8
Within the TreeGrower, we should be using scikit-learn/sklearn/tree/_criterion.pyx Lines 286 to 316 in 0a36bd8
The above calculation should be the one used to compute Once this is done, then we can simply change all references of scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx Lines 238 to 240 in 0a36bd8
NotImplementedError from the recursive version of compute_partial_dependence since this is now supported.
What are your thoughts on my approach? If this works, I was thinking we should split this into two pull requests, where the first one is changing the underlying TreeNode to incorporate Thanks for your time! |
From my shallow understanding of BaseHistGradientBoosting implementation, this looks about right! Awesome job unraveling the details! :D For benchmarking, scikit-learn uses airspeed velocity to monitor performance, and I think this PR should run the benchmarks for the HistGradientBoosting from there. You can follow this tutorial on how to use it to compare your branch versus main branch. But I also think that some from timeit import timeit
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import make_classification
def fit_predict_hist(n_samples):
X, y = make_classification(n_samples=n_samples)
hgbc = HistGradientBoostingClassifier(random_state=42).fit(X, y)
probs = hgbc.predict_proba(X)
n_samples_list = [100, 1000, 5000]
times = [timeit(lambda: fit_predict_hist(n_samples), number=10) for n_samples in n_samples_list]
print(times)
>>> [2.026070300000015, 12.3248289, 15.59315509999999] This is just a draft, and you can certainly:
You can save it and then plot the main vs your branch to compare the change: from sklearn import __version__
import joblib
joblib.dump(times, f"time_list_{__version__}") Update: actually, using from timeit import repeat
...
times = [repeat(lambda: fit_predict_hist(n_samples), repeat=5, number=10) for n_samples in n_samples_list]
np.array(times).mean(axis=1)
>>> array([ 1.92954972, 11.46854726, 13.89775882])
np.array(times).std(axis=1)
>>> array([0.01262523, 0.30100737, 0.16548444]) |
Hi, thanks for your feedback and suggestions! I will get to it and keep you updated. |
Hi @vitaliset, I have updated the BaseHistGradientBoosting fit function and the Grower class to keep track of scikit-learn/sklearn/tree/_classes.py Line 143 in bf03a63
scikit-learn/sklearn/tree/_classes.py Lines 316 to 320 in bf03a63
scikit-learn/sklearn/tree/_tree.pyx Lines 221 to 224 in bf03a63
While I think that is is good to include this in the tree for the Grower, doing so would require changing the signature of the init function for BaseHistGradientBoosting. I am wondering if we should change it now for completeness or leave it as is to avoid changing the public interface too much. Regardless, I will do some tests to check for its correctness and performance. Do you have a small example of the expected outputs of the BaseHistGradientBoosting with and without sample weights? |
Hello @Andrew-Wang-IB45! If Regarding tests, I think you can follow the logic here to see if it is doing what it is supposed to do: import pandas as pd
import numpy as np
from sklearn import __version__
print("sklearn version:", __version__)
>>> sklearn version: 1.1.3
from sklearn.datasets import make_classification
from sklearn.ensemble import HistGradientBoostingClassifier
X, y = make_classification(n_samples=1000, weights=(0.9,), random_state=42)
sample_weight = np.random.RandomState(42).exponential(scale=1 + 4*y)
hgbc = (
HistGradientBoostingClassifier(random_state=42, max_depth=1, max_iter=1)
.fit(X, y, sample_weight)
)
pd.DataFrame(hgbc._predictors[0][0].nodes) aux = pd.DataFrame(hgbc._predictors[0][0].nodes).loc[0]
feat_idx = int(aux.feature_idx)
num_tresh = aux.num_threshold
sample_weight[X[:, feat_idx] >= num_tresh].sum(), sample_weight[X[:, feat_idx] < num_tresh].sum()
>>> (456.6039224516335, 897.213567789653) These numbers should be the |
Hi @vitaliset, as you can see from my draft pull request, I have added support for using the user-inputted sample weights to compute and keep track of the |
Hello @Andrew-Wang-IB45, the first test I would implement would be similar to the code I gave here. I don't know if I was clear, but you can, for instance, assert that the number you get calculating the sum of the sample_weight[X[:, feat_idx] >= num_tresh].sum(), sample_weight[X[:, feat_idx] < num_tresh].sum()
>>> (456.6039224516335, 897.213567789653) In this example, one leaf should have Another interesting test might be having only two unique values for Once you are comfortable with the code, even if you think you still need extra tests, ask a core contributor to review the PR, and they will suggest additional tests! :D |
Hi @vitaliset, thanks for your suggestions! I have incorporated them into my branch, cleaned it up, and made it available for review. Out of curiosity, how do I request for a reviewer for a pull request? |
Hi @vitaliset. For the pull request to incorporate the |
Hi @Andrew-Wang-IB45. Simplifying and taking a smaller step might be a way. I suggested on your PR not to do extra calculations beyond the essential ones to store the |
Hi @vitaliset, thanks for your suggestions. I am currently removing the extraneous code so that it only computes the |
Describe the workflow you want to enable
As partial dependence of a model at a point is defined as an expectation, it should respect
sample_weight
if someone wishes to use it (for instance, when you know yourX
does not follow the distribution you are interested in).#25209 tries to solve this for
method='brute'
when you have newX
. For older tree-based models trained with sample_weights,method='recursion'
keeps track of the trainingsample_weight
and calculates thepartial_dependece
with that into consideration.But, as discussed during the implementation of
sample_weight
on the HistGradientBoosting estimators (#14696 (comment)), these models stores an attribute_fitted_with_sw
and whenpartial_dependece
with recursion is asked, it throws an error:scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
Lines 1142 to 1148 in 205f3b7
Describe your proposed solution
As discussed in #24872 (comment), the big difference between other tree-based algorithms and HistGradientBoosting is that HistGradientBoosting does not save the
weighted_n_node_samples
when building the tree.Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: