TimeSeriesSplit add skip parameter #24243

matthiasschuurmans · 2022-08-24T08:00:07Z

Describe the workflow you want to enable

Dear sklearn community, I want to make an hour ahead forecast on a timeseries of at least a year with an interval of 5 minutes. To do CV, I can use TimeSeriesSplit with test_size = 12. I want to do this for many different times of the day, so I could use n_splits = 24 and have a split for each hour of the day. Then I want this for multiple days, so I could do n_splits = 24 * nr_days. My problem is that I want to randomly test a few hours a day over many days of the year, to see how my forecast does on different periods of the year. To then make n_splits = 24 * 365 is too many splits.

Describe your proposed solution

I propose a skip parameter, that allows skipping a number of samples before making the next split. This can be used to reduce the number of splits, while still allowing to cover a large time-period with splits.

A few examples:
Data:

index	foo
2022-01-01	1
2022-01-02	2
2022-01-03	3
2022-01-04	4
2022-01-05	5
2022-01-06	6
2022-01-07	7
2022-01-08	8
2022-01-09	9
2022-01-10	10

test_size=1, max_train_size=1, n_splits=3, gap=1, skip=1

train	test
8	10
6	8
4	6

test_size=1, max_train_size=1, n_splits=3, gap=1, skip=2

train	test
8	10
5	7
2	4

test_size=2, max_train_size=1, n_splits=3, gap=1, skip=2
not enough data

test_size=2, max_train_size=1, n_splits=2, gap=1, skip=2

train	test
7	9, 10
3	5, 6

test_size=2, max_train_size=1, n_splits=2, gap=1, skip=2

train	test
7	9, 10
3	5, 6

test_size=2, n_splits=2, gap=2, skip=2

train	test
1, 2	5, 6
1, 2, 3, 4, 5, 6	9, 10

Describe alternatives you've considered, if relevant

I could generate all those splits, then implement a wrapper in my own codebase that does the skipping.

Additional context

I already have this solution implemented locally. The important changes to the function TimeSeriesSplit.split() are:

        if n_samples - gap - ((test_size + skip) * n_splits) < 0:
            raise ValueError(
                f"Too many splits={n_splits} for number of samples"
                f"={n_samples} with test_size={test_size}, gap={gap} and skip={skip}."
            )

test_starts = range(n_samples - n_splits * (test_size + skip) + skip, n_samples, test_size + skip)

I'd be happy to make a PR if you think this can be included. Thank you for your time and consideration!

The text was updated successfully, but these errors were encountered:

glemaitre · 2022-10-17T12:40:32Z

Would it not be possible to get the behaviour only setting gap? If I understand properly, skip is skipping sometimes between splits which looks exactly like what gap is intended to.

matthiasschuurmans · 2022-10-18T10:03:59Z

Hey @glemaitre , that's not possible as far as I can tell. gap is for in between the train and test part of a split, my suggested skip is for in between whole splits. The code where gap is used is

for test_start in test_starts:
    train_end = test_start - gap

glemaitre · 2022-10-18T10:09:30Z

@matthiasschuurmans Could you come up with a quick schema to illustrate the skipping?

Does the "whole splits" mean that you want rolling windows as in #22523 do avoid reusing the beginning of the time series?

matthiasschuurmans · 2022-10-18T10:54:20Z

@glemaitre

Does that make sense? gap is between train and test, skip is between the different tests. Right now there is no skip parameter in the class, so the test end of the previous split and the test start of the next split are always next to each other.

I want to cover a larger time period with less splits, in my specific case have folds of 3 days train and 1 hour test spread out over a whole year of 5 minute level data. Say I'd want a fold every week, with 5 minute granularity data I could put max_train_size to 12 (hour) * 24 (day) * 3 = 864, test_size to 12 and skip to 12 * 24 * 7 = 2016.

I don't really understand #22523 and the accompanying PR to be honest, max_train_size already allows rolling windows as thomasjpfan also says, and figuring out max n_splits doesn't seem like it needs a whole new class. I want something different.

glemaitre · 2022-10-18T13:04:37Z

OK, the figure makes it clear now. Thanks @matthiasschuurmans.

I don't really understand #22523 and the accompanying PR to be honest, max_train_size already allows rolling windows as thomasjpfan also says, and figuring out max n_splits doesn't seem like it needs a whole new class. I want something different.

That's why I wanted to be sure about the requested feature to know if it would be covered by the rolling windows.

matthiasschuurmans added Needs Triage Issue requires triage New Feature labels Aug 24, 2022

cmarmo added the module:model_selection label Sep 17, 2022

MSchmidt99 mentioned this issue Oct 7, 2022

ENH added RollingWindowCV to sklearn.model_selection #24589

Closed

glemaitre added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeSeriesSplit add skip parameter #24243

TimeSeriesSplit add skip parameter #24243

TimeSeriesSplit add skip parameter #24243

TimeSeriesSplit add skip parameter #24243

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context