-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add GroupTimeSeriesSplit with params - gap and test_size #19996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
And behavior change with max_train_size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold
. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit
over them, removing the need for a lot of the code here (both implementation and test). I think @scikit-learn/core-devs need to decide if this kind of "contiguous group in time" case is sufficiently general to be included in the scikit-learn library.
Note that unlike standard cross-validation methods, successive | ||
training sets are supersets of those that come before them. | ||
|
||
The groups should be continuous. For Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The groups should be continuous. For Example: | |
The groups should be contiguous. For Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in
GroupKFold
. With this definition, you could effectively just treat the groups as samples and runTimeSeriesSplit
over them, removing the need for a lot of the code here (both implementation and test)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the assumption of contiguous groups is a bit special relative to our handling of groups in splitters. OTOH, TimeSeriesSplit is really the only one so far that assumes a lot about the order of samples (except insofar as kfold without shuffling attempts to respect input ordering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok . I'll try make it work for non-contiguous groups then . I had originally started on that path and this was the reason for creating a dictionary with group names as keys.
I agree with what @jnothman said |
Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @kylsong? |
… TimeSeriesSplit.
According to your description, I reimplemented |
@KylSong - you can catch that error from
|
To @getgaurav2:
.rst is from @getgaurav2 's branch d6c7f0f (Because it seems to have been merged by parent Issue #14257 and #16236). Please contact me about recovering your contribution in this pull request.
What I Accomplished:
I rewrote the GroupTimeSeriesSplit (resolve #14257), added the
gap
andtest_size
, and changed the behavior ofmax_train_size
(fixes #19072) .Because #19072 is in response to a comment by @albertvillanova, he mentioned that
test_size
should specify the number of groups in the test set.So in this pull request,
gap
,test_size
, andmax_train_size
will be representing the number of groups instead of the number of samples.Design Document
More test details and ideas: https://drive.google.com/file/d/1umQrTNrNs8U2pIACKbCk3gL77tdqoFfL/view?usp=sharing