8000 Tree Optimal split - from LightGBM · Issue #9960 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Tree Optimal split - from LightGBM #9960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rspadim opened this issue Oct 19, 2017 · 6 comments
Closed

Tree Optimal split - from LightGBM #9960

rspadim opened this issue Oct 19, 2017 · 6 comments

Comments

@rspadim
Copy link
rspadim commented Oct 19, 2017

implement split logic, like LightGBM
Optimal split: https://media.readthedocs.org/pdf/lightgbm/latest/lightgbm.pdf

lightgbm commits: microsoft/LightGBM#762


Optimal Split for Categorical Features

We often convert the categorical features into one-hot coding. However, it is not a good solution in tree learner. The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.

Actually, the optimal solution is partitioning the categorical feature into 2 subsets, and there are 2^(k-1) - 1 possible partitions. But there is a efficient solution for regression tree[7]. It needs about k * log(k) to find the optimal partition.

The basic idea is reordering the categories according to the relevance of training target. More specifically, reordering the histogram (of categorical feature) according to it's accumulate values (sum_gradient / sum_hessian), then find the best split on the sorted histogram.


[7] Walter D. Fisher. "On Grouping for Maximum Homogeneity." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.

http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479

@TomDLT
Copy link
Member
TomDLT commented Oct 20, 2017

Categorical split in tree-based methods is definitely a feature we want.

There has been a lot of work about it in #4899 (and also #8030).
Is the implementation in lightGBM different?

@lesteve
Copy link
Member
lesteve commented Oct 20, 2017

cc @glemaitre who has looked recently at XGBoost and LightGBM.

@Gitman-code
Copy link

If I understand the method proposed, it should be theoretically independent of the actual split criterion. Following the release of #10325 a solution for categorical splitting should be integrated in a split criterion independent way when implemented.

@arnonbruno
Copy link

Guys,
Would you please be able to explain how this split happens with an example?
I'm not sure I understood how categorical variables are handled.

Ty

@adrinjalali
Copy link
Member

@arnonbruno please refer to the discussions and the related referenced literature on #12866 and #4899

@NicolasHug
Copy link
Member

Closing, this is tracked in #15550 now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
0