Add GroupTimeSeriesSplit with params - gap and test_size #19996

soso-song · 2021-04-28T05:37:19Z

To @getgaurav2:

.rst is from @getgaurav2 's branch d6c7f0f (Because it seems to have been merged by parent Issue #14257 and #16236). Please contact me about recovering your contribution in this pull request.

What I Accomplished:

I rewrote the GroupTimeSeriesSplit (resolve #14257), added the gap and test_size, and changed the behavior of max_train_size (fixes #19072) .
Because #19072 is in response to a comment by @albertvillanova, he mentioned that test_size should specify the number of groups in the test set.
So in this pull request, gap, test_size, and max_train_size will be representing the number of groups instead of the number of samples.

Design Document

More test details and ideas: https://drive.google.com/file/d/1umQrTNrNs8U2pIACKbCk3gL77tdqoFfL/view?usp=sharing

And behavior change with max_train_size.

jnothman

I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit over them, removing the need for a lot of the code here (both implementation and test). I think @scikit-learn/core-devs need to decide if this kind of "contiguous group in time" case is sufficiently general to be included in the scikit-learn library.

jnothman · 2021-04-28T06:54:15Z

sklearn/model_selection/_split.py

+    Note that unlike standard cross-validation methods, successive
+    training sets are supersets of those that come before them.
+
+    The groups should be continuous. For Example:


Suggested change

The groups should be continuous. For Example:

The groups should be contiguous. For Example:

"I think this meaning of group, as a collection of contiguous samples that make up an individual data point, is a bit different from the meaning in GroupKFold. With this definition, you could effectively just treat the groups as samples and run TimeSeriesSplit over them, removing the need for a lot of the code here (both implementation and test)."

@jnothman - Is this a critique for #16236 also ?

Yes, the assumption of contiguous groups is a bit special relative to our handling of groups in splitters. OTOH, TimeSeriesSplit is really the only one so far that assumes a lot about the order of samples (except insofar as kfold without shuffling attempts to respect input ordering)

Ok . I'll try make it work for non-contiguous groups then . I had originally started on that path and this was the reason for creating a dictionary with group names as keys.

soso-song · 2021-04-28T07:09:55Z

I agree with what @jnothman said

jnothman · 2021-04-28T13:02:20Z

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @kylsong?

… TimeSeriesSplit.

soso-song · 2021-04-28T23:55:45Z

Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @KylSong?

According to your description, I reimplemented GroupTimeSeriesSplit in _split.py.
I haven't deleted any test case for GroupTimeSeriesSplit yet. There is a small problem. TimeSeriesSplit could raise an error during runtime: Cannot have number of folds=4 greater than the number of *samples*=2.
But what we want in GroupTimeSeriesSplit is Cannot have number of folds=4 greater than the number of *groups*=2, so I slightly changed the wording of the test case for a temporary fix.
I'm still in the learning stage. Welcome to leave any comments or suggestions.

getgaurav2 · 2021-04-29T01:30:58Z

< A6F0 blockquote>
Does that mean you intend to reimplement it and provide a version that builds on TimeSeriesSplit, or that you think it doesn't belong here, @KylSong?

According to your description, I reimplemented GroupTimeSeriesSplit in _split.py.
I haven't deleted any test case for GroupTimeSeriesSplit yet. There is a small problem. TimeSeriesSplit could raise an error during runtime: Cannot have number of folds=4 greater than the number of *samples*=2.
But what we want in GroupTimeSeriesSplit is Cannot have number of folds=4 greater than the number of *groups*=2, so I slightly changed the wording of the test case for a temporary fix.
I'm still in the learning stage. Welcome to leave any comments or suggestions.

@KylSong - you can catch that error from TimeSeriesSplit by putting it inside a try block . Raise your own error msg from inside GroupTimeSeriesSplit . Something like ( not exact syntax) :

try :
 TimeSeriesSplit(n_splits=n_splits,
                                    max_train_size=max_train_size,
                                    test_size=test_size, gap=gap)
except ValueError as ve:
     if "Cannot have number of folds" in ve:
           raise ValueError("Cannot have number of folds=4 greater than the number of *groups*=2")

Add GroupTimeSeriesSplit with gap, test_size

ff79da9

And behavior change with max_train_size.

github-actions bot added the module:model_selection label Apr 28, 2021

soso-song added 2 commits April 28, 2021 01:55

fix linting

2504aca

fix docs linting

3d1ffaf

jnothman reviewed Apr 28, 2021

View reviewed changes

soso-song changed the title ~~MRG Add GroupTimeSeriesSplit with params - gap and test_size~~ WIP Add GroupTimeSeriesSplit with params - gap and test_size Apr 28, 2021

soso-song closed this Apr 28, 2021

soso-song changed the title ~~WIP Add GroupTimeSeriesSplit with params - gap and test_size~~ Add GroupTimeSeriesSplit with params - gap and test_size Apr 28, 2021

Reimplement GroupTimeSeriesSplit and provide a version that builds on…

539d958

… TimeSeriesSplit.

soso-song reopened this Apr 28, 2021

soso-song closed this May 10, 2021

getgaurav2 mentioned this pull request Sep 9, 2021

FEA Group aware Time-based cross validation #16236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add GroupTimeSeriesSplit with params - gap and test_size #19996

Add GroupTimeSeriesSplit with params - gap and test_size #19996

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	The groups should be continuous. For Example:
	The groups should be contiguous. For Example:

Uh oh!

Add GroupTimeSeriesSplit with params - gap and test_size #19996

Add GroupTimeSeriesSplit with params - gap and test_size #19996

Uh oh!

Conversation

To @getgaurav2:

What I Accomplished:

Design Document

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!