8000 [WIP] Add TimeSeriesCV and HomogeneousTimeSeriesCV by yenchenlin · Pull Request #6351 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Add TimeSeriesCV and HomogeneousTimeSeriesCV #6351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
< 8000 copilot-diff-entry data-file-path="sklearn/model_selection/__init__.py" >
2 changes: 2 additions & 0 deletions sklearn/model_selection/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from ._split import KFold
from ._split import LabelKFold
from ._split import StratifiedKFold
from ._split import HomogeneousTimeSeriesCV
from ._split import LeaveOneLabelOut
from ._split import LeaveOneOut
from ._split import LeavePLabelOut
Expand All @@ -27,6 +28,7 @@

__all__ = ('BaseCrossValidator',
'GridSearchCV',
'HomogeneousTimeSeriesCV',
'KFold',
'LabelKFold',
'LabelShuffleSplit',
Expand Down
8000
115 changes: 115 additions & 0 deletions sklearn/model_selection/_split.py
Original file line number Diff line number Diff line change
Expand Up @@ -637,6 +637,121 @@ def split(self, X, y, labels=None):
"""
return super(StratifiedKFold, self).split(X, y, labels)

class HomogeneousTimeSeriesCV(_BaseKFold):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is convenient to make HomogeneousTimeSeriesCV subclass _BaseKFold since __init__ in _BaseKFold can help HomogeneousTimeSeriesCV to check whether input parameter n_folds is valid.

However, in order to make HomogeneousTimeSeriesCV work, HomogeneousTimeSeriesCV needs to override function split which is defined in its superclass _BaseKFold and its super-superclass BaseCrossValidator due to their implementation detail of split.

I don't think override split in HomogeneousTimeSeriesCV is a good solution since other subclasses of _BaseKFold didn't override split but override _iter_test_indices and _iter_test_masks instead, can @rvraghav93 provide some suggestions on this?

"""Homogeneous Time Series cross-validator

Provides train/test indices to split time series data in train/test sets.

This cross-validation object is a variation of KFold.
In iteration k, it returns first k folds as train set and k+1 fold as
test set.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
8000 n_folds : int, default=3
Number of folds. Must be at least 2.

Examples
--------
>>> from sklearn.model_selection import HomogeneousTimeSeriesCV
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4])
>>> htscv = HomogeneousTimeSeriesCV(n_folds=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd to me that n_folds actually returns n_folds - 1 number of splits.

Any specific reason to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we split data into n_folds folds,
then there are actually only n_folds-1 number of splits for TimeSeriesCV since the first fold can never be a test fold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also mentioned in following blog:

Split data in train and test set given a Date (i.e. test set is what happens after 2 April 2014 included).
Split train set (i.e. what happens before 2 April 2014 not included) in for example 10 consecutive time folds.
Then, in order not lo lose the time information, perform the following steps:
Train on fold 1 –> Test on fold 2
Train on fold 1+2 –> Test on fold 3
Train on fold 1+2+3 –> Test on fold 4
Train on fold 1+2+3+4 –> Test on fold 5
Train on 8000 fold 1+2+3+4+5 –> Test on fold 6
Train on fold 1+2+3+4+5+6 –> Test on fold 7
Train on fold 1+2+3+4+5+6+7 –> Test on fold 8
Train on fold 1+2+3+4+5+6+7+8 –> Test on fold 9
Train on fold 1+2+3+4+5+6+7+8+9 –> Test on fold 10
Compute the average of the accuracies of the 9 test folds (number of folds – 1)

http://francescopochetti.com/pythonic-cross-validation-time-series-pandas-scikit-learn/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should explain more in the doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I understand that.

I always view n_folds = n_test_folds which is 9 in your example which is the same as n_splits. Not sure what the "right" thing is.

Maybe @jnothman @rvraghav93 ?

>>> htscv.get_n_splits(X)
3
>>> print(htscv) # doctest: +NORMALIZE_WHITESPACE
HomogeneousTimeSeriesCV(n_folds=2)
>>> for train_index, test_index in htscv.split(X):
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]

Notes
-----
The first ``n_samples % n_folds`` folds have size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note is confusing, to say the least.

You should mention that for the first n_samples % n_folds, the number of samples in each fold are "incremented" by n_samples // n_folds + 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe I'm too dumb.

the number of samples in each fold are "incremented" by n_samples // n_folds + 1

Why is "incremented" used here?
I think the number of samples in the first n_samples % n_folds folds is exactly n_samples // n_folds + 1,
which is "incremented" by 1 compared to other folds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is related to the prev comment. Sorry I meant in "each split the number of samples are incremented by"

You can move this to the n_folds documentation to avoid such confusion

``n_samples // n_folds + 1``, other folds have size
``n_samples // n_folds``, where ``n_samples`` is the number of samples.

Number of splitting iterations in this cross-validator, n_folds-1,< 8000 /span>
is not equal to other KFold based cross-validators'.

See also
--------
"""
def __init__(self, n_folds=3):
super(HomogeneousTimeSeriesCV, self).__init__(n_folds,
shuffle=False,
random_state=None)

def split(self, X, y=None, labels=None):
"""Generate indices to split data into training and test set.

Parameters
----------
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples
and n_features is the number of features.

y : array-like, shape (n_samples,)
The target variable for supervised learning problems.

labels : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.

Returns
-------
train : ndarray
The training set indices for that split.

test : ndarray
The testing set indices for that split.
"""
X, y, labels = indexable(X, y, labels)
n_samples = _num_samples(X)
if self.n_folds > n_samples:
raise ValueError(
("Cannot have number of folds n_folds={0} greater"
" than the number of samples: {1}.").format(self.n_folds,
n_samples))
n_folds = self.n_folds
indices = np.arange(n_samples)
fold_sizes = (n_samples // n_folds) * np.ones(n_folds, dtype=np.int)
fold_sizes[:n_samples % n_folds] += 1
current = 0
for fold_size in fold_sizes:
start, stop = current, current + fold_size
if current != 0:
yield indices[:start], indices[start:stop]
current = stop

def get_n_splits(self, X=None, y=None, labels=None):
"""Returns the number of splitting iterations in the cross-validator

Parameters
----------
X : object
Always ignored, exists for compatibility.

y : object
Always ignored, exists for compatibility.

labels : object
Always ignored, exists for compatibility.

Returns
-------
n_splits : int
Returns the number of splitting iterations in the cross-validator.
"""
return self.n_folds-1


class LeaveOneLabelOut(BaseCrossValidator):
"""Leave One Label Out cross-validator

Expand Down
34 changes: 34 additions & 0 deletions sklearn/model_selection/tests/test_split.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import LabelKFold
from sklearn.model_selection import HomogeneousTimeSeriesCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeaveOneLabelOut
from sklearn.model_selection import LeavePOut
Expand Down Expand Up @@ -970,6 +971,39 @@ def test_label_kfold():
next, LabelKFold(n_folds=3).split(X, y, labels))


def test_homogeneous_time_series_cv():
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14]]

# Should fail if there are more folds than samples
assert_raises_regexp(ValueError, "Cannot have number of folds.*greater",
next,
HomogeneousTimeSeriesCV(n_folds=9).split(X))

htscv = HomogeneousTimeSeriesCV(3)

# Manually check that Homogeneous Time Series CV preserves the data
# ordering on toy datasets
splits = htscv.split(X[:-1])
train, test = next(splits)
assert_array_equal(train, [0, 1])
assert_array_equal(test, [2, 3])

train, test = next(splits)
assert_array_equal(train, [0, 1, 2, 3])
assert_array_equal(test, [4, 5])

splits = HomogeneousTimeSeriesCV(3).split(X)
train, test = next(splits)
assert_array_equal(train, [0, 1, 2])
assert_array_equal(test, [3, 4])

train, test = next(splits)
assert_array_equal(train, [0, 1, 2, 3, 4])
assert_array_equal(test, [5, 6])

# Check get_n_splits returns the number of folds - 1
assert_equal(2, htscv.get_n_splits())

def test_nested_cv():
# Test if nested cross validation works with different combinations of cv
rng = np.random.RandomState(0)
Expand Down
0