8000 TimeSeriesSplit add skip parameter · Issue #24243 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

TimeSeriesSplit add skip parameter #24243

New issue 8000

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
matthiasschuurmans opened this issue Aug 24, 2022 · 5 comments
Open

TimeSeriesSplit add skip parameter #24243

matthiasschuurmans opened this issue Aug 24, 2022 · 5 comments
Labels

Comments

@matthiasschuurmans
Copy link
matthiasschuurmans commented Aug 24, 2022

Describe the workflow you want to enable

Dear sklearn community, I want to make an hour ahead forecast on a timeseries of at least a year with an interval of 5 minutes. To do CV, I can use TimeSeriesSplit with test_size = 12. I want to do this for many different times of the day, so I could use n_splits = 24 and have a split for each hour of the day. Then I want this for multiple days, so I could do n_splits = 24 * nr_days. My problem is that I want to randomly test a few hours a day over many days of the year, to see how my forecast does on different periods of the year. To then make n_splits = 24 * 365 is too many splits.

Describe your proposed solution

I propose a skip parameter, that allows skipping a number of samples before making the next split. This can be used to reduce the number of splits, while still allowing to cover a large time-period with splits.

A few examples:
Data:

index foo
2022-01-01 1
2022-01-02 2
2022-01-03 3
2022-01-04 4
2022-01-05 5
2022-01-06 6
2022-01-07 7
2022-01-08 8
2022-01-09 9
2022-01-10 10

test_size=1, max_train_size=1, n_splits=3, gap=1, skip=1

train test
8 10
6 8
4 6

test_size=1, max_train_size=1, n_splits=3, gap=1, skip=2

train test
8 10
5 7
2 4

test_size=2, max_train_size=1, n_splits=3, gap=1, skip=2
not enough data

test_size=2, max_train_size=1, n_splits=2, gap=1, skip=2

train test
7 9, 10
3 5, 6

test_size=2, max_train_size=1, n_splits=2, gap=1, skip=2

train test
7 9, 10
3 5, 6

test_size=2, n_splits=2, gap=2, skip=2

train test
1, 2 5, 6
1, 2, 3, 4, 5, 6 9, 10

Describe alternatives you've considered, if relevant

I could generate all those splits, then implement a wrapper in my own codebase that does the skipping.

Additional context

I already have this solution implemented locally. The important changes to the function TimeSeriesSplit.split() are:

        if n_samples - gap - ((test_size + skip) * n_splits) < 0:
            raise ValueError(
                f"Too many splits={n_splits} for number of samples"
                f"={n_samples} with test_size={test_size}, gap={gap} and skip={skip}."
            )
test_starts = range(n_samples - n_splits * (test_size + skip) + skip, n_samples, test_size + skip)

I'd be happy to make a PR if you think this can be included. Thank you for your time and consideration!

@glemaitre
Copy link
Member
glemaitre commented Oct 17, 2022

Would it not be possible to get the behaviour only setting gap? If I understand properly, skip is skipping sometimes between splits which looks exactly like what gap is intended to.

@matthiasschuurmans
Copy link
Author

Hey @glemaitre , that's not possible as far as I can tell. gap is for in between the train and test part of a split, my suggested skip is for in between whole splits. The code where gap is used is

for test_start in test_starts:
    train_end = test_start - gap

@glemaitre
Copy link
Member

@matthiasschuurmans Could you come up with a quick schema to illustrate the skipping?

Does the "whole splits" mean that you want rolling windows as in #22523 do avoid reusing the beginning of the time series?

@matthiasschuurmans
Copy link
Author
matthiasschuurmans commented Oct 18, 2022

@glemaitre
154545460-d1629e8a-22a4-494d-affc-2df3cb95ade7

Does that make sense? gap is between train and test, skip is between the different tests. Right now there is no skip parameter in the class, so the test end of the previous split and the test start of the next split are always next to each other.

I want to cover a larger time period with less splits, in my specific case have folds of 3 days train and 1 hour test spread out over a whole year of 5 minute level data. Say I'd want a fold every week, with 5 minute granularity data I could put max_train_size to 12 (hour) * 24 (day) * 3 = 864, test_size to 12 and skip to 12 * 24 * 7 = 2016.

I don't really understand #22523 and the accompanying PR to be honest, max_train_size already allows rolling windows as thomasjpfan also says, and figuring out max n_splits doesn't seem like it needs a whole new class. I want something different.

@glemaitre
Copy link
Member

OK, the figure makes it clear now. Thanks @matthiasschuurmans.

I don't really understand #22523 and the accompanying PR to be honest, max_train_size already allows rolling windows as thomasjpfan also says, and figuring out max n_splits doesn't seem like it needs a whole new class. I want something different.

That's why I wanted to be sure about the requested feature to know if it would be covered by the rolling windows.

@glemaitre glemaitre added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
0