-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In your examples in the docstrings, why do the training sets sometimes have larger values than the testing sets? That would mean training a model on the future and predicting data from the past.
Notice how for TimeSeriesSplit all the training indices precede the test indices:
|
Yes, it is totally possible to train models on the future data and validate them on the past data. |
Help needed. The tests passed locally in my build but failed in some other builds. What could be the cause? |
Sorry I've not had time to look at this yet. Have you checked the build
logs?
https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=2928
|
I finally found the cause, which is the different interpretations of
, where Linux pylatest_conda interprets it as
Linux py35_conda_openblas and Linux py35_np_atlas interpret it as
According to the numpy manual, the first one is the correct interpretation, even for numpy |
|
ooooooooooooooo|||||||||||||xxxxxxxxxxxxx|||||||||||||||||||||||oooooooooooooooooooooo See here for more explanation. |
btw people have told me we should just implement https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series though I don't think it has gaps? |
I'm not convinced we want to promote the gap approach here. The caret
approach seems useful but it is surprising that there is no way to limit
the number of splits: it just generate every possible test set. There is
also no gap.
|
No, it doesn't. Another R package named blockCV provides this functionality. It is about space series, but it also applies to time series (considering the time series as 1-D space series). |
Seems there's not enough support to add this to scikit-learn. Might be good for |
I guess I'll just leave this here 😅 https://koaning.github.io/scikit-lego/user-guide/cross-validation/#timegapsplit |
Time series have temporal dependence, which may cause information leaks during the cross-validation.
One way to mitigate this risk is by introducing gaps between the training set and the testing set.
This PR implements such a feature for leave-p-out, K-fold, and the naive train-test split.
As for the walk-forward one, @kykosic is implementing, among others, a similar feature for the class
TimeSeriesSplit
in #13204. I reckon his implementation promising, so I refrain from reinventing the wheel.Concerning my implementation, I "refactored" the whole structure while still keeping the same public API.
GapCrossValidator
replacesBaseCrossValidator
and becomes the base whereGapLeavePOut
andGapKFold
derive from. Although not tested, all other subclasses, I believe, can derive from the newGapCrossValidator
. I put the quotation marks on the word refactor because I didn't really touch the original code. Instead, my code currently coexists with the original one.Classes and functions added:
Related issues and PRs
#6322, #13204
Related users
@kykosic, @amueller, @jnothman, @cbrummitt