8000 Add GroupTimeSeriesSplit with params - gap and test_size by soso-song · Pull Request #19996 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Conversation

soso-song
Copy link

To @getgaurav2:

.rst is from @getgaurav2 's branch d6c7f0f (Because it seems to have been merged by parent Issue #14257 and #16236). Please contact me about recovering your contribution in this pull request.

What I Accomplished:

I rewrote the GroupTimeSeriesSplit (resolve #14257), added the gap and test_size, and changed the behavior of max_train_size (fixes #19072) .
Because #19072 is in response to a comment by @albertvillanova, he mentioned that test_size should specify the number of groups in the test set.
So in this pull request, gap, test_size, and max_train_size will be representing the number of groups instead of the number of samples.

Design Document

More test details and ideas: https://drive.google.com/file/d/1umQrTNrNs8U2pIACKbCk3gL77tdqoFfL/view?usp=sharing

And behavior change with max_train_size.
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit over them, removing the need for a lot of the code here (both implementation and test). I think @scikit-learn/core-devs need to decide if this kind of "contiguous group in time" case is sufficiently general to be included in the scikit-learn library.

Note that unlike standard cross-validation methods, successive
training sets are supersets of those that come before them.

The groups should be continuous. For Example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The groups should be continuous. For Example:
The groups should be contiguous. For Example:

Copy link
Contributor
@getgaurav2 getgaurav2 Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit over them, removing the need for a lot of the code here (both implementation and test)."

@jnothman - Is this a critique for #16236 also ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the assumption of contiguous groups is a bit special relative to our handling of groups in splitters. OTOH, TimeSeriesSplit is really the only one so far that assumes a lot about the order of samples (except insofar as kfold without shuffling attempts to respect input ordering)

Copy link
Contributor
@getgaurav2 getgaurav2 Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok . I'll try make it work for non-contiguous groups then . I had originally started on that path and this was the reason for creating a dictionary with group names as keys.

@soso-song soso-song changed the title MRG Add GroupTimeSeriesSplit with params - gap and test_size WIP Add GroupTimeSeriesSplit with params - gap and test_size Apr 28, 2021
@soso-song
Copy link
Author

I agree with what @jnothman said

8000

@soso-song soso-song closed this Apr 28, 2021
@soso-song soso-song changed the title WIP Add GroupTimeSeriesSplit with params - gap and test_size Add GroupTimeSeriesSplit with params - gap and test_size Apr 28, 2021
@jnothman
Copy link
Member

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @kylsong?

@soso-song soso-song reopened this Apr 28, 2021
@soso-song
Copy link
Author

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @KylSong?

According to your description, I reimplemented GroupTimeSeriesSplit in _split.py.
I haven't deleted any test case for GroupTimeSeriesSplit yet. There is a small problem. TimeSeriesSplit could raise an error during runtime: Cannot have number of folds=4 greater than the number of *samples*=2.
But what we want in GroupTimeSeriesSplit is Cannot have number of folds=4 greater than the number of *groups*=2, so I slightly changed the wording of the test case for a temporary fix.
I'm still in the learning stage. Welcome to leave any comments or suggestions.

@getgaurav2
Copy link
Contributor
getgaurav2 commented Apr 29, 2021

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @KylSong?

According to your description, I reimplemented GroupTimeSeriesSplit in _split.py.
I haven't deleted any test case for GroupTimeSeriesSplit yet. There is a small problem. TimeSeriesSplit could raise an error during runtime: Cannot have number of folds=4 greater than the number of *samples*=2.
But what we want in GroupTimeSeriesSplit is Cannot have number of folds=4 greater than the number of *groups*=2, so I slightly changed the wording of the test case for a temporary fix.
I'm still in the learning stage. Welcome to leave any comments or suggestions.

@KylSong - you can catch that error from TimeSeriesSplit by putting it inside a try block . Raise your own error msg from inside GroupTimeSeriesSplit . Something like ( not exact syntax) :

try :
 TimeSeriesSplit(n_splits=n_splits,
                                    max_train_size=max_train_size,
                                    test_size=test_size, gap=gap)
except ValueError as ve:
     if "Cannot have number of folds" in ve:
           raise ValueError("Cannot have number of folds=4 greater than the number of *groups*=2")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
0