[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

WenjieZ · 2019-05-01T21:23:24Z

Time series have temporal dependence, which may cause information leaks during the cross-validation.
One way to mitigate this risk is by introducing gaps between the training set and the testing set.
This PR implements such a feature for leave-p-out, K-fold, and the naive train-test split.
As for the walk-forward one, @kykosic is implementing, among others, a similar feature for the class TimeSeriesSplit in #13204. I reckon his implementation promising, so I refrain from reinventing the wheel.

Concerning my implementation, I "refactored" the whole structure while still keeping the same public API. GapCrossValidator replaces BaseCrossValidator and becomes the base where GapLeavePOut and GapKFold derive from. Although not tested, all other subclasses, I believe, can derive from the new GapCrossValidator. I put the quotation marks on the word refactor because I didn't really touch the original code. Instead, my code currently coexists with the original one.

Classes and functions added:

GapCrossValidator
- GapLeavePOut
- GapKFold
gap_train_test_split

Related issues and PRs

#6322, #13204

Related users

@kykosic, @amueller, @jnothman, @cbrummitt

cbrummitt · 2019-05-02T02:14:42Z

In your examples in the docstrings, why do the training sets sometimes have larger values than the testing sets? That would mean training a model on the future and predicting data from the past.

>>> for train_index, test_index in kf.split(np.arange(10)):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    TRAIN: [6 7 8 9] TEST: [0 1]
    TRAIN: [8 9] TEST: [2 3]
    TRAIN: [0] TEST: [4 5]
    TRAIN: [0 1 2] TEST: [6 7]
    TRAIN: [0 1 2 3 4] TEST: [8 9]

Notice how for TimeSeriesSplit all the training indices precede the test indices:

>>> for train_index, test_index in tscv.split(X):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]

WenjieZ · 2019-05-02T06:50:20Z

Yes, it is totally possible to train models on the future data and validate them on the past data.
It is theoretically valid for, say, stationary time series.

WenjieZ · 2019-05-07T03:51:30Z

Help needed. The tests passed locally in my build but failed in some other builds. What could be the cause?

jnothman · 2019-05-08T00:00:26Z

Sorry I've not had time to look at this yet. Have you checked the build logs? https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=2928

WenjieZ · 2019-05-10T14:34:53Z

I finally found the cause, which is the different interpretations of

a[[False, True, True, False, True]]

, where a is a numpy ndarray.

Linux pylatest_conda interprets it as

a[[1, 2, 4]]

Linux py35_conda_openblas and Linux py35_np_atlas interpret it as

a[[0, 1, 1, 0, 1]]

According to the numpy manual, the first one is the correct interpretation, even for numpy v1.11.

jnothman · 2019-05-29T11:38:42Z

What benefit does gap_after provide?
Why can't you implement GapKFold as a small change to TimeSeriesSplit?

WenjieZ · 2019-05-29T18:06:00Z

gap_before provides a gap before the test set, and gap_after provides a gap after the test set. The subset after this last gap is a part of the training set.

ooooooooooooooo|||||||||||||xxxxxxxxxxxxx|||||||||||||||||||||||oooooooooooooooooooooo
----training set---------gap-------test set------------gap-----------------training set

See here for more explanation.

amueller · 2019-05-29T21:19:38Z

btw people have told me we should just implement https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series

though I don't think it has gaps?

jnothman · 2019-05-29T21:47:26Z

I'm not convinced we want to promote the gap approach here. The caret approach seems useful but it is surprising that there is no way to limit the number of splits: it just generate every possible test set. There is also no gap.

WenjieZ · 2019-05-30T08:23:23Z

though I don't think it has gaps?

No, it doesn't. Another R package named blockCV provides this functionality. It is about space series, but it also applies to time series (considering the time series as 1-D space series).

adrinjalali · 2024-03-06T11:28:58Z

Seems there's not enough support to add this to scikit-learn. Might be good for skrub?

cc @glemaitre @koaning @GaelVaroquaux @ogrisel

koaning · 2024-03-06T12:50:56Z

I guess I'll just leave this here 😅

https://koaning.github.io/scikit-lego/user-guide/cross-validation/#timegapsplit

WenjieZ added 9 commits April 29, 2019 20:39

add class GapCrossValidator

3d8ccef

simplify GapCrossValidator

9eb8876

add class GapLeavePOut

1005dd4

add docstring to GapLeavePOut

e581b1d

add class GapKFold

6e04492

update __init__.py

f1e8df8

add gap_train_test_split

b39dda6

add docstring to gap_train_test_split

65678d3

gap_size=0 by default

90ffa23

WenjieZ changed the title ~~Feature: Cross-validation for time series (inserting gaps between the training and the testing)~~ [WIP] Cross-validation for time series (inserting gaps between the training and the testing) May 1, 2019

WenjieZ added 7 commits May 2, 2019 11:24

flake8

4684821

fix the "scikit-learn.scikit-learn" test

77d5c84

fix mark-->mask, add tests for GapCrossValidator

6c130f9

add tests for GapLeavePOut

1bdaebe

add tests for GapKFold

0c376bf

add tests for gap_train_test_split

49bc9a0

flake8

6a147b8

use np.nonzero() for boolean indexing

0f7eb71

WenjieZ changed the title ~~[WIP] Cross-validation for time series (inserting gaps between the training and the testing)~~ [MRG] Cross-validation for time series (inserting gaps between the training set and the test set) May 10, 2019

WenjieZ mentioned this pull request May 10, 2019

Some CI builds misinterpret numpy boolean indexing #13858

Closed

amueller mentioned this pull request Jul 23, 2019

Implement WalkForward cross-validator for time series data. #14376

Closed

amueller added the Needs Decision Requires decision label Aug 6, 2019

github-actions bot added the module:model_selection label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

adrinjalali closed this Mar 6, 2024

sluofoss mentioned this pull request Sep 26, 2024

FEA Group aware Time-based cross validation #16236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

Uh oh!

Conversation

Uh oh!

Classes and functions added:

Related issues and PRs

Related users

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!