8000 Add GroupTimeSeriesSplit with params - gap and test_size by soso-song · Pull Request #19996 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add GroupTimeSeriesSplit with params - gap and test_size #19996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

soso-song
Copy link

To @getgaurav2:

.rst is from @getgaurav2 's branch d6c7f0f (Because it seems to have been merged by parent Issue #14257 and #16236). Please contact me about recovering your contribution in this pull request.

What I Accomplished:

I rewrote the GroupTimeSeriesSplit (resolve #14257), added the gap and test_size, and changed the behavior of max_train_size (fixes #19072) .
Because #19072 is in response to a comment by @albertvillanova, he mentioned that test_size should specify the number of groups in the test set.
So in this pull request, gap, test_size, and max_train_size will be representing the number of groups instead of the number of samples.

Design Document

More test details and ideas: https://drive.google.com/file/d/1umQrTNrNs8U2pIACKbCk3gL77tdqoFfL/view?usp=sharing

And behavior change with max_train_size.
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit over them, removing the need for a lot of the code here (both implementation and test). I think @scikit-learn/core-devs need to decide if this kind of "contiguous group in time" case is sufficiently general to be included in the scikit-learn library.

Note that unlike standard cross-validation methods, successive
training sets are supersets of those that come before them.

The groups should be continuous. For Example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The groups should be continuous. For Example:
The groups should be contiguous. For Example:

Copy link
Contributor
@getgaurav2 getgaurav2 Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit over them, removing the need for a lot of the code here (both implementation and test)."

@jnothman - Is this a critique for #16236 also ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the assumption of contiguous groups is a bit special relative to our handling of groups in splitters. OTOH, TimeSeriesSplit is really the only one so far that assumes a lot about the order of samples (except insofar as kfold without shuffling attempts to respect input ordering)

Copy link
Contributor
@getgaurav2 getgaurav2 Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok . I'll try make it work for non-contiguous groups then . I had originally started on that path and this was the reason for creating a dictionary with group names as keys.

@soso-song soso-song changed the title MRG Add GroupTimeSeriesSplit with params - gap and test_size WIP Add GroupTimeSeriesSplit with params - gap and test_size Apr 28, 2021
@soso-song
Copy link
Author

I agree with what @jnothman said

@soso-song soso-song closed this Apr 28, 2021
@soso-song soso-song changed the title WIP Add GroupTimeSeriesSplit with params - gap and test_size Add GroupTimeSeriesSplit with params - gap and test_size Apr 28, 2021
@jnothman
Copy link
Member

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @kylsong?

@soso-song soso-song reopened this Apr 28, 2021
@soso-song
Copy link
Author

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @KylSong?

According to your description, I reimplemented GroupTimeSeriesSplit in _split.py.
I haven't deleted any test case for GroupTimeSeriesSplit yet. There is a small problem. TimeSeriesSplit could raise an error during runtime: Cannot have number of folds=4 greater than the number of *samples*=2.
But what we want in GroupTimeSeriesSplit is Cannot have number of folds=4 greater than the number of *groups*=2, so I slightly changed the wording of the test case for a temporary fix.
I'm still in the learning stage. Welcome to leave any comments or suggestions.

@getgaurav2
Copy link
Contributor
getgaurav2 commented Apr 29, 2021

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @KylSong?

According to your description, I reimplemented GroupTimeSeriesSplit in _split.py.
I haven't deleted any test case for GroupTimeSeriesSplit yet. There is a small problem. TimeSeriesSplit could raise an error during runtime: Cannot have number of folds=4 greater than the number of *samples*=2.
But what we want in GroupTimeSeriesSplit is Cannot have number of folds=4 greater than the number of *groups*=2, so I slightly changed the wording of the test case for a temporary fix.
I'm still in the learning stage. Welcome to leave any comments or suggestions.

@KylSong - you can catch that error from TimeSeriesSplit by putting it inside a try block . Raise your own error msg from inside GroupTimeSeriesSplit . Something like ( not exact syntax) :

try :
 TimeSeriesSplit(n_splits=n_splits,
                                    max_train_size=max_train_size,
                                    test_size=test_size, gap=gap)
except ValueError as ve:
     if "Cannot have number of folds" in ve:
           raise ValueError("Cannot have number of folds=4 greater than the number of *groups*=2")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
0