Add GroupTimeSeriesSplit with params - gap and test_size#19996
Add GroupTimeSeriesSplit with params - gap and test_size#19996soso-song wants to merge 4 commits intoscikit-learn:mainfrom
Conversation
And behavior change with max_train_size.
There was a problem hiding this comment.
I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit over them, removing the need for a lot of the code here (both implementation and test). I think @scikit-learn/core-devs need to decide if this kind of "contiguous group in time" case is sufficiently general to be included in the scikit-learn library.
| Note that unlike standard cross-validation methods, successive | ||
| training sets are supersets of those that come before them. | ||
|
|
||
| The groups should be continuous. For Example: |
There was a problem hiding this comment.
| The groups should be continuous. For Example: | |
| The groups should be contiguous. For Example: |
There was a problem hiding this comment.
"I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in
GroupKFold. With this definition, you could effectively just treat the groups as samples and runTimeSeriesSplitover them, removing the need for a lot of the code here (both implementation and test)."
There was a problem hiding this comment.
Yes, the assumption of contiguous groups is a bit special relative to our handling of groups in splitters. OTOH, TimeSeriesSplit is really the only one so far that assumes a lot about the order of samples (except insofar as kfold without shuffling attempts to respect input ordering)
There was a problem hiding this comment.
Ok . I'll try make it work for non-contiguous groups then . I had originally started on that path and this was the reason for creating a dictionary with group names as keys.
|
I agree with what @jnothman said |
|
Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @kylsong? |
… TimeSeriesSplit.
According to your description, I reimplemented |
@KylSong - you can catch that error from |
To @getgaurav2:
.rst is from @getgaurav2 's branch d6c7f0f (Because it seems to have been merged by parent Issue #14257 and #16236). Please contact me about recovering your contribution in this pull request.
What I Accomplished:
I rewrote the GroupTimeSeriesSplit (resolve #14257), added the
gapandtest_size, and changed the behavior ofmax_train_size(fixes #19072) .Because #19072 is in response to a comment by @albertvillanova, he mentioned that
test_sizeshould specify the number of groups in the test set.So in this pull request,
gap,test_size, andmax_train_sizewill be representing the number of groups instead of the number of samples.Design Document
More test details and ideas: https://drive.google.com/file/d/1umQrTNrNs8U2pIACKbCk3gL77tdqoFfL/view?usp=sharing