8000 [WIP] GradientBoostingClassifierCV without early stopping by raghavrv · Pull Request #8226 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] GradientBoostingClassifierCV without early stopping #8226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

raghavrv
Copy link
Member
@raghavrv raghavrv commented Jan 24, 2017

Spin off from #7071 without the complications of early stopping API

This PR tries to implement just GradientBoostingClassifierCV (And I intend to restrict it to GBCCV / GBRCV alone without early stopping support). It takes advantage of the incremental boosting stages and for the same performance is much faster than GridSearchCV.

Results

Code for the plot - https://gist.github.com/raghavrv/21d59453de5c6890c89e9f907bcd4044

Thanks @agramfort for IRL discussions leading to this simpler PR!!

Also ping @amueller, @jnothman, @vighneshbirodkar, @ogrisel and @pprett

TODO

  • Add tests
  • Polish example's doc Remove example
  • GradientBoostingRegressorCV
  • Tests for GBRCV.
  • Add links to this at relevant places in doc
  • Post memory comparison with GridSearchCV

Sample groups for the cross-validation splitter.
"""
if isinstance(self.cv_n_estimators, (numbers.Integral, np.integer)):
print('heee')
Copy link
Member
@jnothman jnothman Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arghh sorry forgot to remove the scaffold :@

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably useful in practice. However, I think adding a use_warm_start parameter to GridSearchCV would automatically handle this case, RandomForests, SGD, etc. without defining a new API. WDYT?

Otherwise, please add this to the list in doc/modules/grid_search.rst, and to appropriate "see also"s.

@@ -0,0 +1,77 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this example is worth having. A nice benchmark, but users don't gain from playing with it; as much can be said ("Will always improve performance over GridSearchCV for searching over n_estimators") in narrative docs and what's new.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll add it as a gist snippet at the PR description...

"""
if isinstance(self.cv_n_estimators, (numbers.Integral, np.integer)):
print('heee')
cv_n_estimators = np.array([self.cv_n_estimators, ], dtype=np.int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this case should be interpreted as range(1, cv_n_estimators + 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yea. That would be more useful...

learning rate shrinks the contribution of each tree by `learning_rate`.
There is a trade-off between learning_rate and n_estimators.

cv_n_estimators : int or array-like of shape (n_cv_stages), (default=100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this cv_ prefix elsewhere for *CV objects? We use Cs, alphas, etc. That convention is hard to adopt here. learning_curve uses param_range, and I think n_estimators_range would be okay here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for n_estimators_range... Thx!

estimator.set_params(n_estimators=n_estimators)
estimator.fit(X_train, y_train, sample_weight=weight_train)
all_stage_scores[i] = scorer(estimator, X_test, y_test,
sample_weight=weight_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of sample_weight differs from GridSearchCV. Might be best to leave it out for now, else to note it prominently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But eventually we would want to support sample_weights at GridSearchCV no? I'll push a commit which documents this... Let me know if you still want it removed... (Moreover the GradientBoostingClassifier.fit supports sw, and users might expect a similar interface maybe?)

@raghavrv
Copy link
Member Author

Closing in favor of #8230

@raghavrv raghavrv closed this Jan 24, 2017
@raghavrv
Copy link
Member Author

@jnothman From recent discussions, I think this is simpler and also easier to use...

@raghavrv
Copy link
Member Author

I'm opening this now. I'll address your comments on this soon...

@raghavrv raghavrv reopened this Jan 25, 2017
@jnothman
Copy link
Member
jnothman commented Jan 26, 2017 via email

@raghavrv
Copy link
Member Author
raghavrv commented Jan 30, 2017

Have removed example and addressed your comments. Will add tests, clean up docs and ping you back!

@raghavrv
Copy link
Member Author

Closing again in favor of updated #8230 ;)

@raghavrv raghavrv closed this Feb 13, 2017
@jnothman
Copy link
Member

Hahaha you can reopen to implement early stopping...?

@raghavrv
Copy link
Member Author

:p I was wondering about it... What would be the API? Do we perform cv or specify a validation set?

@raghavrv
Copy link
Member Author

Also ping @agramfort! :) we discussed about this and decided to split it into two different problems the GradientBoostingCV part and the early stopping part... Now with #8230, using warm_start is solved... What do you think is the way to go for early stopping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0