-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Tree Optimal split - from LightGBM #9960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @glemaitre who has looked recently at XGBoost and LightGBM. |
If I understand the method proposed, it should be theoretically independent of the actual split criterion. Following the release of #10325 a solution for categorical splitting should be integrated in a split criterion independent way when implemented. |
Guys, Ty |
@arnonbruno please refer to the discussions and the related referenced literature on #12866 and #4899 |
Closing, this is tracked in #15550 now |
implement split logic, like LightGBM
Optimal split: https://media.readthedocs.org/pdf/lightgbm/latest/lightgbm.pdf
lightgbm commits: microsoft/LightGBM#762
Optimal Split for Categorical Features
We often convert the categorical features into one-hot coding. However, it is not a good solution in tree learner. The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.
Actually, the optimal solution is partitioning the categorical feature into 2 subsets, and there are 2^(k-1) - 1 possible partitions. But there is a efficient solution for regression tree[7]. It needs about k * log(k) to find the optimal partition.
The basic idea is reordering the categories according to the relevance of training target. More specifically, reordering the histogram (of categorical feature) according to it's accumulate values (sum_gradient / sum_hessian), then find the best split on the sorted histogram.
[7] Walter D. Fisher. "On Grouping for Maximum Homogeneity." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479
The text was updated successfully, but these errors were encountered: