-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Add TimeSeriesCV and HomogeneousTimeSeriesCV #6351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -637,6 +637,121 @@ def split(self, X, y, labels=None): | |
""" | ||
return super(StratifiedKFold, self).split(X, y, labels) | ||
|
||
class HomogeneousTimeSeriesCV(_BaseKFold): | ||
"""Homogeneous Time Series cross-validator | ||
|
||
Provides train/test indices to split time series data in train/test sets. | ||
|
||
This cross-validation object is a variation of KFold. | ||
In iteration k, it returns first k folds as train set and k+1 fold as | ||
test set. | ||
|
||
Read more in the :ref:`User Guide <cross_validation>`. | ||
|
||
Parameters | ||
---------- | ||
8000 | n_folds : int, default=3 | |
Number of folds. Must be at least 2. | ||
|
||
Examples | ||
-------- | ||
>>> from sklearn.model_selection import HomogeneousTimeSeriesCV | ||
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) | ||
>>> y = np.array([1, 2, 3, 4]) | ||
>>> htscv = HomogeneousTimeSeriesCV(n_folds=4) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems a bit odd to me that Any specific reason to do this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we split data into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is also mentioned in following blog:
http://francescopochetti.com/pythonic-cross-validation-time-series-pandas-scikit-learn/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I should explain more in the doc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I understand that. I always view Maybe @jnothman @rvraghav93 ? |
||
>>> htscv.get_n_splits(X) | ||
3 | ||
>>> print(htscv) # doctest: +NORMALIZE_WHITESPACE | ||
HomogeneousTimeSeriesCV(n_folds=2) | ||
>>> for train_index, test_index in htscv.split(X): | ||
... print("TRAIN:", train_index, "TEST:", test_index) | ||
... X_train, X_test = X[train_index], X[test_index] | ||
... y_train, y_test = y[train_index], y[test_index] | ||
TRAIN: [0] TEST: [1] | ||
TRAIN: [0 1] TEST: [2] | ||
TRAIN: [0 1 2] TEST: [3] | ||
|
||
Notes | ||
----- | ||
The first ``n_samples % n_folds`` folds have size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This note is confusing, to say the least. You should mention that for the first There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry maybe I'm too dumb.
Why is "incremented" used here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, this is related to the prev comment. Sorry I meant in "each split the number of samples are incremented by" You can move this to the |
||
``n_samples // n_folds + 1``, other folds have size | ||
``n_samples // n_folds``, where ``n_samples`` is the number of samples. | ||
|
||
Number of splitting iterations in this cross-validator, n_folds-1,< 8000 /span> | ||
is not equal to other KFold based cross-validators'. | ||
|
||
See also | ||
-------- | ||
""" | ||
def __init__(self, n_folds=3): | ||
super(HomogeneousTimeSeriesCV, self).__init__(n_folds, | ||
shuffle=False, | ||
random_state=None) | ||
|
||
def split(self, X, y=None, labels=None): | ||
"""Generate indices to split data into training and test set. | ||
|
||
Parameters | ||
---------- | ||
X : array-like, shape (n_samples, n_features) | ||
Training data, where n_samples is the number of samples | ||
and n_features is the number of features. | ||
|
||
y : array-like, shape (n_samples,) | ||
The target variable for supervised learning problems. | ||
|
||
labels : array-like, with shape (n_samples,), optional | ||
Group labels for the samples used while splitting the dataset into | ||
train/test set. | ||
|
||
Returns | ||
------- | ||
train : ndarray | ||
The training set indices for that split. | ||
|
||
test : ndarray | ||
The testing set indices for that split. | ||
""" | ||
X, y, labels = indexable(X, y, labels) | ||
n_samples = _num_samples(X) | ||
if self.n_folds > n_samples: | ||
raise ValueError( | ||
("Cannot have number of folds n_folds={0} greater" | ||
" than the number of samples: {1}.").format(self.n_folds, | ||
n_samples)) | ||
n_folds = self.n_folds | ||
indices = np.arange(n_samples) | ||
fold_sizes = (n_samples // n_folds) * np.ones(n_folds, dtype=np.int) | ||
fold_sizes[:n_samples % n_folds] += 1 | ||
current = 0 | ||
for fold_size in fold_sizes: | ||
start, stop = current, current + fold_size | ||
if current != 0: | ||
yield indices[:start], indices[start:stop] | ||
current = stop | ||
|
||
def get_n_splits(self, X=None, y=None, labels=None): | ||
"""Returns the number of splitting iterations in the cross-validator | ||
|
||
Parameters | ||
---------- | ||
X : object | ||
Always ignored, exists for compatibility. | ||
|
||
y : object | ||
Always ignored, exists for compatibility. | ||
|
||
labels : object | ||
Always ignored, exists for compatibility. | ||
|
||
Returns | ||
------- | ||
n_splits : int | ||
Returns the number of splitting iterations in the cross-validator. | ||
""" | ||
return self.n_folds-1 | ||
|
||
|
||
class LeaveOneLabelOut(BaseCrossValidator): | ||
"""Leave One Label Out cross-validator | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is convenient to make
HomogeneousTimeSeriesCV
subclass_BaseKFold
since__init__
in_BaseKFold
can helpHomogeneousTimeSeriesCV
to check whether input parametern_folds
is valid.However, in order to make
HomogeneousTimeSeriesCV
work,HomogeneousTimeSeriesCV
needs to override functionsplit
which is defined in its superclass_BaseKFold
and its super-superclassBaseCrossValidator
due to their implementation detail ofsplit
.I don't think override
split
inHomogeneousTimeSeriesCV
is a good solution since other subclasses of_BaseKFold
didn't overridesplit
but override_iter_test_indices
and_iter_test_masks
instead, can @rvraghav93 provide some suggestions on this?