-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
TimeSeriesSplit add skip parameter #24243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would it not be possible to get the behaviour only setting |
Hey @glemaitre , that's not possible as far as I can tell.
|
@matthiasschuurmans Could you come up with a quick schema to illustrate the skipping? Does the "whole splits" mean that you want rolling windows as in #22523 do avoid reusing the beginning of the time series? |
Does that make sense? I want to cover a larger time period with less splits, in my specific case have folds of 3 days train and 1 hour test spread out over a whole year of 5 minute level data. Say I'd want a fold every week, with 5 minute granularity data I could put I don't really understand #22523 and the accompanying PR to be honest, |
OK, the figure makes it clear now. Thanks @matthiasschuurmans.
That's why I wanted to be sure about the requested feature to know if it would be covered by the rolling windows. |
Describe the workflow you want to enable
Dear sklearn community, I want to make an hour ahead forecast on a timeseries of at least a year with an interval of 5 minutes. To do CV, I can use TimeSeriesSplit with test_size = 12. I want to do this for many different times of the day, so I could use n_splits = 24 and have a split for each hour of the day. Then I want this for multiple days, so I could do n_splits = 24 * nr_days. My problem is that I want to randomly test a few hours a day over many days of the year, to see how my forecast does on different periods of the year. To then make n_splits = 24 * 365 is too many splits.
Describe your proposed solution
I propose a
skip
parameter, that allows skipping a number of samples before making the next split. This can be used to reduce the number of splits, while still allowing to cover a large time-period with splits.A few examples:
Data:
test_size=1, max_train_size=1, n_splits=3, gap=1, skip=1
test_size=1, max_train_size=1, n_splits=3, gap=1, skip=2
test_size=2, max_train_size=1, n_splits=3, gap=1, skip=2
not enough data
test_size=2, max_train_size=1, n_splits=2, gap=1, skip=2
test_size=2, max_train_size=1, n_splits=2, gap=1, skip=2
test_size=2, n_splits=2, gap=2, skip=2
Describe alternatives you've considered, if relevant
I could generate all those splits, then implement a wrapper in my own codebase that does the skipping.
Additional context
I already have this solution implemented locally. The important changes to the function
TimeSeriesSplit.split()
are:I'd be happy to make a PR if you think this can be included. Thank you for your time and consideration!
The text was updated successfully, but these errors were encountered: