-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Add TimeSeriesCV and HomogeneousTimeSeriesCV #6351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add TimeSeriesCV and HomogeneousTimeSeriesCV #6351
Conversation
75e7401
to
04b9e79
Compare
@@ -637,6 +637,121 @@ def split(self, X, y, labels=None): | |||
""" | |||
return super(StratifiedKFold, self).split(X, y, labels) | |||
|
|||
class HomogeneousTimeSeriesCV(_BaseKFold): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is convenient to make HomogeneousTimeSeriesCV
subclass _BaseKFold
since __init__
in _BaseKFold
can help HomogeneousTimeSeriesCV
to check whether input parameter n_folds
is valid.
However, in order to make HomogeneousTimeSeriesCV
work, HomogeneousTimeSeriesCV
needs to override function split
which is defined in its superclass _BaseKFold
and its super-superclass BaseCrossValidator
due to their implementation detail of split
.
I don't think override split
in HomogeneousTimeSeriesCV
is a good solution since other subclasses of _BaseKFold
didn't override split
but override _iter_test_indices
and _iter_test_masks
instead, can @rvraghav93 provide some suggestions on this?
73f5661
to
62bafb4
Compare
62bafb4
to
78c3dcd
Compare
Please split the PR into two for now. Will be easier to review. |
|
||
Notes | ||
----- | ||
The first ``n_samples % n_folds`` folds have size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note is confusing, to say the least.
You should mention that for the first n_samples % n_folds
, the number of samples in each fold are "incremented" by n_samples // n_folds + 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry maybe I'm too dumb.
the number of samples in each fold are "incremented" by n_samples // n_folds + 1
Why is "incremented" used here?
I think the number of samples in the first n_samples % n_folds
folds is exactly n_samples // n_folds + 1
,
which is "incremented" by 1 compared to other folds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is related to the prev comment. Sorry I meant in "each split the number of samples are incremented by"
You can move this to the n_folds
documentation to avoid such confusion
Just gave a first pass. I also agree that this would be highly useful!! |
@MechCoder Thanks! |
Close this PR since I plan to separate this into two PRs:
|
This PR is an implementation of time series cross validation.
Based on #6322 , there are basically two cases:
For brevity,
I'll use HomoTSCV to represent homogeneous time series and HeteroTSCV to represent heterogeneous time series in the following checklist.
Check List:
Reference:
Using k-fold cross-validation for time-series model selection