8000 Document the advantage of our new GBDT · Issue #15392 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Document the advantage of our new GBDT #15392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qinhanmin2014 opened this issue Oct 29, 2019 · 10 comments
Closed

Document the advantage of our new GBDT #15392

qinhanmin2014 opened this issue Oct 29, 2019 · 10 comments

Comments

@qinhanmin2014
Copy link
Member

I think it will be useful to document the advantage of our new GBDT (perhaps along with some benchmarks), so that users know when to use it.

Some insights from Nicolas:

  • as guillaume said we can 100% control how they interact with the scikit-learn ecosystem (cross-validation, grid search etc.). This isn't the case for the other libraries which may only support part of it. Typically I'm not sure the LightGBM estimators pass our checks
  • scikit-learn is arguably more popular than LightGBM or XGBoost alone, so the estimators have more exposure by being included here
  • The APIs are significantly different. For example I really doubt our API for categorical variables will be similar to that of LightGBM.
  • not everybody has a GPU, and the CPU implem is still order of magnitude faster than the other GBDT estimators that we have

Though personally I'm still not persuaded, e.g.,

  • For the first reason, I think if we really care about interact with sklearn, perhaps a better way is to collaborate with existing GBDT, instead of writing a new one. Another possible way is to be more tolerant, e.g., flatten the prediction in voting.
  • For the second reason, if we only consider GBDT, then I think xgboost, lightgbm, catboost is more famous than scikit-learn.
@StrikerRUS
Copy link
Contributor

Typically I'm not sure the LightGBM estimators pass our checks

Just want to let you know that they actually do (except check_estimators_nan_inf check).

https://github.com/microsoft/LightGBM/blob/11f9682bb72dea34313758f054ccf37886dc7e6e/tests/python_package_test/test_sklearn.py#L220-L236

@qinhanmin2014
Copy link
Member Author

Just want to let you know that they actually do (except check_estimators_nan_inf check).

cc @NicolasHug

Perhaps it's worthwhile to investigate the scikit-learn API of some famous libraries after the release (e.g., xgboost, lightgbm, catboost, keras)

@NicolasHug
Copy link
Member

I don't think it's up to us to tell users why they should be using our implementation rather than some other. We're quite partial ;)

For a user browsing the scikit-learn docs, the advantage is obvious: it's here. No need to install a new package.

They can just use the scikit-learn implementation, and they know it will always be fully compatible with the rest of our API.

By having our own version, we have control over the way we implement new features, to keep them in line with our current ecosystem of tools. LightGBM typically isn't fully compatible, see #13679 or #15127 (comment)

But none of these points are IMO relevant for the UG

@qinhanmin2014
Copy link
Member Author

Let's leave it after the release, I'll take some time to read the code ASAP.

@StrikerRUS
Copy link
Contributor

LightGBM typically isn't fully compatible ...

I guess, for third-party packages, "fully compatible" means that they successfully pass check_estimator test like it's said in https://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator. Am I wrong?

@NicolasHug
Copy link
Member

Unfortunately our check_estimator suite is far from perfect / exhaustive, so it does not guarantee full compatibility. Typically, #13679 isn't something that should happen, yet the estimator still passes the checks according to #15392 (comment)

@StrikerRUS
Copy link
Contributor

Ah, I see! Then I think there will be good to create a meta-issue with list of all known non-covered by check_estimator cases and possible enhancements of the test suit and provide a link to it on that page. Seems that it will greatly help developers who want make their estimators "fully compatible" with scikit-learn.

@glemaitre
Copy link
Member

@NicolasHug I think that we could close this issue probably?

@NicolasHug
Copy link
Member

I would agree but I'd leave the decision up to @qinhanmin2014 ;)

@thomasjpfan
Copy link
Member

Overall, I agree with #15392 (comment) and the previous two comments to close this issue.

As noted in #15392 (comment), the advantage is that it comes packaged with scikit-learn. Any documentation with benchmarks comparing our implementation to others can will become out of date and needs to be re-run. This adds more maintenance on our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0