8000 Add rolling window to sklearn.model_selection.TimeSeriesSplit · Issue #22523 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add rolling window to sklearn.model_selection.TimeSeriesSplit #22523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cmp1 opened this issue Feb 17, 2022 · 10 comments · May be fixed by #23780
Open

Add rolling window to sklearn.model_selection.TimeSeriesSplit #22523

cmp1 opened this issue Feb 17, 2022 · 10 comments · May be fixed by #23780

Comments

@cmp1
Copy link
cmp1 commented Feb 17, 2022

Describe the workflow you want to enable

I wanted to ask whether any plans exist to implement a rolling/sliding window method in the TimeSeriesSplit class:

68747470733a2f2f692e6962622e636f2f4b576b665137712f6e6577706c6f742e706e67

Currently, we are limited to using the expanding window type. For many financial time series models where a feature experiences a structural break, having a model whose weights are trained on the entire history can prove suboptimal.

I noted in #13204, specifically svenstehle's comments, that this might be on the horizon?

Describe your proposed solution

Current Implementation

>>> x = np.arange(15)
>>> cv = TimeSeriesSplit(n_splits=3, gap=2)
>>> for train_index, test_index in cv.split(x):
...      print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [0 1 2 3] TEST: [6 7 8]
TRAIN: [0 1 2 3 4 5 6] TEST: [ 9 10 11]
TRAIN: [0 1 2 3 4 5 6 7 8 9] TEST: [12 13 14]

Desired outcome

>>> x = np.arange(10)
>>> cv = TimeSeriesSplit(n_splits='walk_fw',  max_train_size=3, max_test_size=1)
>>> for train_index, test_index in cv.split(x):
...      print("TRAIN:", train_index, "TEST:", test_index)
TRAIN: [0 1 2] TEST: [3]
TRAIN: [1 2 3] TEST: [ 4]
TRAIN: [2 3 4] TEST: [5]
TRAIN: [ 3 4 5] TEST: [6]
TRAIN: [ 4 5 6] TEST: [7]
TRAIN: [ 5 6 7] TEST: [8]
TRAIN: [ 6 7 8] TEST: [9]

Where the 'stride' of the walk forward is proportionate to the test set, or walks by the max_train_size parameter?

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@thomasjpfan
Copy link
Member

Currently there is rolling window support, where the train set does not grow:

from sklearn.model_selection import TimeSeriesSplit

x = np.arange(15)
cv = TimeSeriesSplit(max_train_size=3, test_size=1)
for train_index, test_index in cv.split(x):
    print("TRAIN:", train_index, "TEST:", test_index)

# TRAIN: [7 8 9] TEST: [10]
# TRAIN: [ 8  9 10] TEST: [11]
# TRAIN: [ 9 10 11] TEST: [12]
# TRAIN: [10 11 12] TEST: [13]
# TRAIN: [11 12 13] TEST: [14]

If we want all the windows, we would need to adjust n_splits explicitly:

from sklearn.model_selection import TimeSeriesSplit

x = np.arange(15)
cv = TimeSeriesSplit(n_splits=12, max_train_size=3 ,test_size=1)
for train_index, test_index in cv.split(x):
    print("TRAIN:", train_index, "TEST:", test_index)

# TRAIN: [0 1 2] TEST: [3]
# TRAIN: [1 2 3] TEST: [4]
# TRAIN: [2 3 4] TEST: [5]
# TRAIN: [3 4 5] TEST: [6]
# TRAIN: [4 5 6] TEST: [7]
# TRAIN: [5 6 7] TEST: [8]
# TRAIN: [6 7 8] TEST: [9]
# TRAIN: [7 8 9] TEST: [10]
# TRAIN: [ 8  9 10] TEST: [11]
# TRAIN: [ 9 10 11] TEST: [12]
# TRAIN: [10 11 12] TEST: [13]
# TRAIN: [11 12 13] TEST: [14]

Is the proposal to have n_splits='walk_fw' provide all the windows automatically?

@cmp1
Copy link
Author
cmp1 commented Apr 19, 2022 via email

@ShehanAT
Copy link
Contributor
ShehanAT commented Jun 9, 2022

Yes. Ideally, the n_splits parameter could be done away with? Therefore, the implementation reduces to how many observations are used in your rolling training window i.e. assuming daily observations, for a given financial time series, you could train a model on a rolling 252 day training window, and validate on a 63 day window, walking forward by the validation size.

So would you accept a PR where the n_splits field is removed as a parameter from the TimeSeriesSplit(_BaseKFold) class in the scikit-learn/sklearn/model-selection/_split.py file?

Something like this?

class TimeSeriesSplit(_BaseKFold):
  def __init__(self, *, max_train_size=None, test_size=None, gap=0):
          super().__init__(n_splits, shuffle=False, random_state=None)
          self.n_splits = 5
          self.max_train_size = max_train_size
          self.test_size = test_size
          self.gap = gap

@thomasjpfan
Copy link
Member
thomasjpfan commented Jun 9, 2022

We can not remove n_splits as it breaks backward compatibility and it has a real use case. We can enhance with a rolling window feature by having n_split='walk_forward'.

@ShehanAT
Copy link
Contributor

So would setting n_splits='walk_forward' be equivalent to n_splits=12? Or is it more complex than this?

I ask because as you mentioned, rolling window support is already included, just not for all windows.
The way I see it n_splits='walk_forward' == n_splits='12' and setting n_splits=12 should include all windows right?
For example:

x = np.arange(15)
cv = TimeSeriesSplit(n_splits='walk_forward', max_train_size=3 ,test_size=1)
for train_index, test_index in cv.split(x):
    print("TRAIN:", train_index, "TEST:", test_index)

# TRAIN: [0 1 2] TEST: [3]
# TRAIN: [1 2 3] TEST: [4]
# TRAIN: [2 3 4] TEST: [5]
# TRAIN: [3 4 5] TEST: [6]
# TRAIN: [4 5 6] TEST: [7]
# TRAIN: [5 6 7] TEST: [8]
# TRAIN: [6 7 8] TEST: [9]
# TRAIN: [7 8 9] TEST: [10]
# TRAIN: [ 8  9 10] TEST: [11]
# TRAIN: [ 9 10 11] TEST: [12]
# TRAIN: [10 11 12] TEST: [13]
# TRAIN: [11 12 13] TEST: [14]

We can not remove n_splits as it breaks backward compatibility and it has a real use case. We can enhance with a rolling window feature by having n_split='walk_forward'.

@thomasjpfan
Copy link
Member

The way I see it n_splits='walk_forward' == n_splits='12' and setting n_splits=12 should include all windows right?

n_split=12 only works when max_train_size=3, test_size=1 and X.shape[0]=15.

For example, if max_train_size=10 and test_size=2, then n_splits need to be set to 4 to properly walk forward:

x = np.arange(15)
cv = TimeSeriesSplit(n_splits=4, max_train_size=10,test_size=1)
for train_index, test_index in cv.split(x):
    print("TRAIN:", train_index, "TEST:", test_index)

n_split="walk_forward" would automatically compute the proper value of n_splits based on X.shape[0], test_size and max_train_size.

@MSchmidt99
Copy link
MSchmidt99 commented Oct 6, 2022

I have created a class for this in the past for my own usage in personal projects where the calculations for max_train_size and test_size are done automatically given the number of splits desired and the proportion of the window allocated to validation. The calculation works out that the window size is equal to the number of samples times the reciprocal of 1 + (validation proportion * (n_splits - 1)).

For example: given a validation proportion of 0.2 (20% of each window is allocated to validation) and a number of folds at 4, the totality of the time steps given must be made up by 1 + (0.2 * 3) = 1.6 windows placed end-to-end. Given a number of time steps N=1000 we get that each window is (1 / 1.6) * 1000 = 625 steps, thus the validation and also shifting amount of each window is 625 * 0.2 = 125 steps.

Another example using the values of previous discussions: if we set max_train_size=3 and test_size=1 it is the same as setting the validation proportion to 0.25 and the number of folds to 12. We then get that for a given sample size we will need to place 1 + (0.25 * 11) = 3.75 windows end-to-end, and thus our window size (in the class it is given the variable batch_size) is 26.66% of the total sample size. At 15 time steps this leaves us at 4 samples per window, with the first 3 used for training and the last 1 used for validation.

In my use case I also needed support for longitudinal data, thus the class allows for a time column to be used for window definition as well. An example of the class applied to multiple stocks is shown below. The code for the class and PR for scikit-learn inclusion are both given at #24589.

image

If inclusion is decided against you may copy the class into a python file, remove the _BaseKFold super, and add an n_splits getter. The only 3 requirements when used as a standalone module are as follows:

import numpy as np
from sklearn.utils import indexable
from sklearn.utils.validation import _num_samples

@msat59
Copy link
msat59 commented May 25, 2023

To add the rolling window method, we don't need to set the train_size as an input. The code can handle it if written properly. I was working on it for my own project; this is what I made to support both the expanding and rolling windows. I can create a PR if agreed.

from sklearn.model_selection._split import _BaseKFold
from sklearn.utils import indexable
from sklearn.utils.validation import _num_samples
import numpy as np

class TimeSeriesSplit(_BaseKFold):

    def __init__(self, n_splits=5, *, max_train_size=None, test_size=None, gap=0, window_method='expanding'):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_size = max_train_size
        self.test_size = test_size
        self.gap = gap
        self.window_method = window_method

    def split(self, X, y=None, groups=None):
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        gap = self.gap
        test_size = (
            self.test_size if self.test_size is not None else n_samples // n_folds
        )

        # Make sure we have enough samples for the given split parameters
        if n_folds > n_samples:
            raise ValueError(
                f"Cannot have number of folds={n_folds} greater"
                f" than the number of samples={n_samples}."
            )
        if n_samples - gap - (test_size * n_splits) <= 0:
            raise ValueError(
                f"Too many splits={n_splits} for number of samples"
                f"={n_samples} with test_size={test_size} and gap={gap}."
            )
        
        if self.window_method not in ['expanding', 'rolling']:
            raise ValueError(f"Unsupported window method: {self.window_method}")
        
        
        indices = np.arange(n_samples)
        
        test_starts = range(n_samples - n_splits * test_size, n_samples, test_size)
        train_start = 0
        for test_start in test_starts:
            train_end = test_start - gap 
    
            if self.max_train_size and self.max_train_size < train_end:
                yield (
                    indices[train_end - self.max_train_size : train_end],
                    indices[test_start : test_start + test_size],
                )
            else:
                yield (
                    indices[train_start : train_end],
                    indices[test_start : test_start + test_size],
                )
           
            if self.window_method == 'rolling':
                train_start += test_size

@cgarciga
Copy link

Hi, what is the status of this? This feature would be very useful. @msat59's code seems to work for me.

@AhmedThahir
Copy link

Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
0