-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Feature: Additional TimeSeriesSplit
Functionality
#13204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FWIW some of this was discussed in #6322 (which could perhaps be closed as
it predated TimeSeriesSplit)
|
TimeSeriesSplit
Functionality TimeSeriesSplit
Functionality
Thanks, I'll add that to the pull request incase relevant. |
I have found this PR after I made issue #13666. I think The gap is an interesting idea. |
Yes this pull request adds in the functionality you are requesting without changing any existing API. Perhaps, as you suggest, more documentation could be made around this, and I would be happy to write some code examples to accompany this code change if it is accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow review. I like your work.
Let's consider caret's approach (https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series) for comparison (thanks for the reference @amueller) and consider the use cases of cross_validate
and learning_curve
.
- Training set min size: this PR assumes >= test_size; caret requires a minimum to be specified.
- Fixed/max size training set: specified as number of samples here; specified by boolean
fixed
in caret - Test set overlap: non-overlapping in this PR; consecutive test sets overlap by (len(test_size) -1) in caret
- Gap: here removed from train set; there no support
n_splits
: here must be specified; there implied by test_size and min_train_size.
For learning curve I can see the benefit of having overlapping test sets. I don't really see it as a great idea in cross-validation, though it may not hurt much, and would benefit when the dataset is small.
Except for the required test-set overlap and the lack of Gap, I quite like Caret's parametrisation. If I was designing this from scratch, I might have (min_train_size, max_train_size (int or 'fixed' or None), test_size, gap)
.
Questions:
- should we allow n_splits=None if test_size is specified? this would generate as many test sets as possible... does this get too confusing?
train_size >= test_size
seems reasonable, but should we allow the user to specify a larger min train size?- should we support
max_train_size='fixed'
? - if we support
n_splits=None
should we supporttest_overlap
where a value of-1
would correspond to caret's version? this all feels a bit too complex. Should we could consider deprecatingn_splits
and moving to the parametrisation I suggest above?
sklearn/model_selection/_split.py
Outdated
@@ -799,21 +832,32 @@ def split(self, X, y=None, groups=None): | |||
n_samples = _num_samples(X) | |||
n_splits = self.n_splits | |||
n_folds = n_splits + 1 | |||
gap_size = self.gap_size | |||
test_size = self.test_size if self.test_size else n_samples // n_folds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this should be if self.test_size is not None
so that 0 and None are not treated the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps the else should be (n_samples - gap) // n_folds
so that train_size remains > gap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the self.test_size is not None
. I'm not convinced that the train_size necessarily needs to be constrained to be greater than the gap, should that be a decision left to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: I might have misunderstood but - is max_train_size
supposed to implement a sliding window? If that's the case please clarify and let's share opinions . I think we can keep it simpler with just train_size
and test_size
and more string options for n_splits
.
My opinion is I like the idea of having sensible defaults and as much control for the user as possible:
- Add option
n_splits='walk_fw'
; should walk forward with specifiedmin_train_size
,test_size
--> keep the functionality ofn_splits
as it is. Idea: also implement a walk forward as sliding window is possible. See more on that below. - What I would love to see added from caret as well is the idea of a sliding window. Please consider including this as well. Caret calls it
fixedWindow
though. The idea is to include only examples inside a fixed time window and then walk forward and move the window along --> could be implemented along the lines ofif n_splits=='window' and min_train_size>1 and test_size>1
. Here,test_size
is the prediction horizon of the window,min_train_size
is the train_size for the window each fold that gets moved up by 1 each time. - Allow user to specify
min_train_size
smaller thangap
, if they so desire (it's a parameter to be explored imo). I agree with you that a reasonable default ismin_train_size >= gap if gap > 0
- Allow user to specify a
min_train_size
both smaller and larger thantest_size
but keep default >= 'test_size' max_train_size='fixed'
--> I'm missing something, what is the use of a max_train_size when doing walk forward while growing the train_set with the previous test_set. Should this implement sliding window?- What is the idea behind test set overlap from caret for walk forward in our case? We would have overlap if we walk forward for 1 time unit and our test_size spans more than 1 time unit. That is the idea? - And I think it's methodically OK. I may be missing something though. On that note also consider the sliding window, we have overlap here for certain if we have a horizon or
test_size>1
:
@jnothman @amueller As some time has elapsed since last discussing this change, these changes implement the basic necessary functionality for walk-forward cross validation #14376. However, I do agree with the idea that |
I suspect n_splits in the sense of "number of train/test pairs to generate" remains a useful parameter for the user to control. But if it gets in the way of usability, we could consider deprecating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kykosic, I do appreciate this as a minimal change, and am more or less ready to approve it. I just want to ensure that it's enough of a change to be useful.
Requested changes made. I'm not sure what to do about the codecov checks failing after merging master into this branch... EDIT: Latest master commit fixed this apparently. |
…into improve_tssplit
Any news on this? I wanted to contribute to add exactly this feature, but it was a pleasant surprise to see it already written in a PR :) |
Yeah it's just kinda been hanging in limbo for a year now, hopefully we can get around to merging it in as I keep copy and pasting this class into a lot of projects I work on 😃 |
An elegant and much-needed addition! I was also looking to contribute like @pepicello, but the problem seems to have been solved. Any idea on when this will be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am happy with this as an immediate enhancement. Sorry for the slow reviews.
Please add an entry to the change log at doc/whats_new/v0.24.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
next(splits) | ||
|
||
|
||
def test_time_series_gap(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test_time_series_test_size
and test_time_series_gap
can be pytest.mark.parametrize
.
The two error cases can be in its own test as well. This is not a blocker and can be done in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to implement this suggestion in a followup PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments. Otherwise LGTM
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Thank you @kykosic ! |
almost exactly 10,000 issues after the first walk-forward issue in 2014! #3202 🍾 thank you for contributing! 🙏 |
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Kyle Kosic <kylekosic@Kyles-MacBook-Pro.local> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Reference Issues/PRs
Related to #6322
Resolves #14376
What does this implement/fix? Explain your changes.
There are two parameters being added to
TimeSeriesSplit
which make it more flexible, particularly when working with financial data.1.
test_size
: int, optionalBy default,
TimeSeriesSplit
divides the data inton_splits + 1
folds, and the size of each split's test set isn_samples / (n_splits + 1)
. However, it is sometimes useful to shift the balance between train and test data when doing cross validation relative to domain use cases.Example use case: I have 4 years (4 * 252 trading days) of stock price data. I wish to run cross-validation with 4 splits. Each split I want to train on 2 years (504 samples) and test on 6 months (126 samples). Currently, using
TimeSeriesSplit(n_splits=4, max_train_size=504)
will only yield test sets of exactly 201 samples, and a train sets starting with 204 samples increasing until 504.The optional
test_size
parameter allows me to control the number of samples used for each test split and use all remaining samples for training.See code docstring for a more visual example.
Test_size=None
by default preserves current functionality.2.
gap
: int, default=0Currently each train/test set are adjacent to each other in sequence. However, for some use cases it may be important remove "gap samples" between each train set and test set when look-ahead bias can result from labeling the data.
Example use case: I am building a classifier to buy/sell stocks. I have some input features for each day, and I label my data by looking ahead 2 days and seeing if the stock price increases or decreases. Since I am looking ahead 2 days to create my data labels, I need to insert a 2 day "gap" between my train and test sets for cross validation to prevent look-ahead bias in my validation score.
If I label my data by looking ahead 2 days, then this is not representative of real-life training. I would not have those 2 extra days to label my data if training a model today. I need to remove the last two elements of each training data set to get an authentic cross validation score.
The default value of
gap=0
maintains current functionality.Any other comments?