10000 Group aware Time-based cross validation - v2 by soso-song · Pull Request #19927 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Group aware Time-based cross validation - v2 #19927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 80 additions & 60 deletions doc/modules/cross_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ following keys -
``['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']``

``return_train_score`` is set to ``False`` by default to save computation time.
To evaluate the scores on the training set as well you need to set it to
To evaluate the scores on the training set as well you need to be set to
``True``.

You may also retain the estimator fitted on each training set by setting
Expand Down Expand Up @@ -353,7 +353,7 @@ Example of 2-fold cross-validation on a dataset with 4 samples::
Here is a visualization of the cross-validation behavior. Note that
:class:`KFold` is not affected by classes or groups.

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_006.png
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_004.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
:scale: 75%
Expand Down Expand Up @@ -509,7 +509,7 @@ Here is a usage example::
Here is a visualization of the cross-validation behavior. Note that
:class:`ShuffleSplit` is not affected by classes or groups.

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_008.png
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_006.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
:scale: 75%
Expand Down Expand Up @@ -566,7 +566,7 @@ We can see that :class:`StratifiedKFold` preserves the class ratios

Here is a visualization of the cross-validation behavior.

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_009.png
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_007.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
:scale: 75%
Expand All @@ -585,7 +585,7 @@ percentage for each target class as in the complete set.

Here is a visualization of the cross-validation behavior.

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_012.png
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_009.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
:scale: 75%
Expand Down Expand Up @@ -645,58 +645,6 @@ size due to the imbalance in the data.

Here is a visualization of the cross-validation behavior.

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_007.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
:scale: 75%

.. _stratified_group_k_fold:

StratifiedGroupKFold
^^^^^^^^^^^^^^^^^^^^

:class:`StratifiedGroupKFold` is a cross-validation scheme that combines both
:class:`StratifiedKFold` and :class:`GroupKFold`. The idea is to try to
preserve the distribution of classes in each split while keeping each group
within a single split. That might be useful when you have an unbalanced
dataset so that using just :class:`GroupKFold` might produce skewed splits.

Example::

>>> from sklearn.model_selection import StratifiedGroupKFold
>>> X = list(range(18))
>>> y = [1] * 6 + [0] * 12
>>> groups = [1, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 4, 5, 5, 5, 6, 6, 6]
>>> sgkf = StratifiedGroupKFold(n_splits=3)
>>> for train, test in sgkf.split(X, y, groups=groups):
... print("%s %s" % (train, test))
[ 0 2 3 4 5 6 7 10 11 15 16 17] [ 1 8 9 12 13 14]
[ 0 1 4 5 6 7 8 9 11 12 13 14] [ 2 3 10 15 16 17]
[ 1 2 3 8 9 10 12 13 14 15 16 17] [ 0 4 5 6 7 11]

Implementation notes:

- With the current implementation full shuffle is not possible in most
scenarios. When shuffle=True, the following happens:

1. All groups a shuffled.
2. Groups are sorted by standard deviation of classes using stable sort.
3. Sorted groups are iterated over and assigned to folds.

That means that only groups with the same standard deviation of class
distribution will be shuffled, which might be useful when each group has only
a single class.
- The algorithm greedily assigns each group to one of n_splits test sets,
choosing the test set that minimises the variance in class distribution
across test sets. Group assignment proceeds from groups with highest to
lowest variance in class frequency, i.e. large groups peaked on one or few
classes are assigned first.
- This split is suboptimal in a sense that it might produce imbalanced splits
even if perfect stratification is possible. If you have relatively close
distribution of classes in each group, using :class:`GroupKFold` is better.

Here is a visualization of cross-validation behavior for uneven groups:

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_005.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
Expand Down Expand Up @@ -785,7 +733,7 @@ Here is a usage example::

Here is a visualization of the cross-validation behavior.

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_011.png
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_008.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
:scale: 75%
Expand Down Expand Up @@ -813,7 +761,7 @@ samples that are part of the validation set, and to -1 for all other samples.
Using cross-validation iterators to split train and test
--------------------------------------------------------

The above group cross-validation functions may also be useful for splitting a
The above group cross-validation functions may also be useful for spitting a
dataset into training and testing subsets. Note that the convenience
function :func:`train_test_split` is a wrapper around :func:`ShuffleSplit`
and thus only allows for stratified splitting (using the class labels)
Expand Down Expand Up @@ -887,11 +835,83 @@ Example of 3-split time series cross-validation on a dataset with 6 samples::

Here is a visualization of the cross-validation behavior.

.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_013.png
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_010.png
:target: ../auto_examples/model_selection/plot_cv_indices.html
:align: center
:scale: 75%

Group Time Series Split
^^^^^^^^^^^^^^^^^^^^^^^

:class:`GroupTimeSeriesSplit` combines :class:`TimeSeriesSplit` with the Group awareness
of `GroupKFold`. Like :class:`TimeSeriesSplit` this also returns first :math:`k` folds
as train set and the :math:`(k+1)` th fold as test set.
Successive training sets are supersets of those that come before them.
Also, it adds all surplus data to the first training partition, which
is always used to train the model.
This class can be used to cross-validate time series data samples
that are observed at fixed time intervals.

The same group will not appear in two different folds (the number of
distinct groups has to be at least equal to the number of folds).

The groups should be Continuous like below.
['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd']

Non-continuous groups like below will give an error.
['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'b', 'a', 'c', 'c', 'c', 'b', 'd', 'd']

`GroupTimeSeriesSplit` is useful in cases where we have time series data for
say multiple days with multiple data points within a day.
During cross-validation we may not want the training days to be be used in testing.
Here the days can act as groups to keep the training and test splits separate.

Example of 3-split time series cross-validation on a dataset with
18 samples and 4 groups::

>>> import numpy as np
>>> from sklearn.model_selection import GroupTimeSeriesSplit
>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',
... 'b', 'b', 'b', 'b', 'b',
... 'c', 'c', 'c', 'c',
... 'd', 'd', 'd'])
>>> gtss = GroupTimeSeriesSplit(n_splits=3)
>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
... print("TRAIN:", train_idx, "TEST:", test_idx)
... print("TRAIN GROUP:", groups[train_idx],
... "TEST GROUP:", groups[test_idx])
TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']
TEST GROUP: ['b' 'b' 'b' 'b' 'b']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
TEST GROUP: ['c' 'c' 'c' 'c']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
TEST: [15, 16, 17]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']
TEST GROUP: ['d' 'd' 'd']

Example of 2-split time series cross-validation on a dataset with
18 samples and 4 groups and 1 test_size and 3 max_train_size and 1 period gap::

>>> import numpy as np
>>> from sklearn.model_selection import GroupTimeSeriesSplit
>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
'b', 'b', 'b', 'b', 'b',\
'c', 'c', 'c', 'c',\
'd', 'd', 'd'])
>>> gtss = GroupTimeSeriesSplit(n_splits=2, test_size=1, gap=1,\
max_train_size=3)
>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
... print("TRAIN:", train_idx, "TEST:", test_idx)
... print("TRAIN GROUP:", groups[train_idx],\
"TEST GROUP:", groups[test_idx])
TRAIN: [0, 1, 2, 3, 4, 5] TEST: [11, 12, 13, 14]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a'] TEST GROUP: ['c' 'c' 'c' 'c']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [15, 16, 17]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
TEST GROUP: ['d' 'd' 'd']

A note on shuffling
===================

Expand Down
121 changes: 121 additions & 0 deletions doc/modules/group_time_series_split.rst
< F438 /tr>
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@

.. _GroupTimeSeriesSplit:

=================================================
sklearn.model_selection.GroupTimeSeriesSplit
=================================================
.. code-block:: python

class sklearn.model_selection.GroupTimeSeriesSplit(n_splits=5, *, max_train_size=None, test_size=None, gap=0)

| *GroupTimeSeriesSplit* combines *TimeSeriesSplit* with the Group awareness of *GroupKFold*.
|
| Like *TimeSeriesSplit* this also returns first *k* folds as train set and the *(k+1)* th fold as test set.
|
| Since the Group applies on this class, the same group will not appear in two different
folds(the number of distinct groups has to be at least equal to the number of folds) which make sure the i.i.d. assumption will not be broken.

| All operations of this CV strategy are done at the group level.
| So all our parameters, not limited to splits, including test_size, gap, and max_train_size, all represent the constraints on the number of groups.


Parameters:
-----------
| **n_splits;int,default=5**
|
| Number of splits. Must be at least 2.
|
| **max_train_size:int, default=None**
|
| Maximum number of groups for a single training set.
|
| **test_size:int, default=None**
|
| Used to limit the number of groups in the test set. Defaults to ``n_samples // (n_splits + 1)``, which is the maximum allowed value with ``gap=0``.
|
| **gap:int, default=0**
|
| Number of groups in samples to exclude from the end of each train set before the test set.

Example 1:
---------
.. code-block:: python

>>> import numpy as np
>>> from sklearn.model_selection import GroupTimeSeriesSplit
>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',
... 'b', 'b', 'b', 'b', 'b',
... 'c', 'c', 'c', 'c',
... 'd', 'd', 'd'])
>>> gtss = GroupTimeSeriesSplit(n_splits=3)
>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
... print("TRAIN:", train_idx, "TEST:", test_idx)
... print("TRAIN GROUP:", groups[train_idx],
... "TEST GROUP:", groups[test_idx])
TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']
TEST GROUP: ['b' 'b' 'b' 'b' 'b']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
TEST GROUP: ['c' 'c' 'c' 'c']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
TEST: [15, 16, 17]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']
TEST GROUP: ['d' 'd' 'd']

Example 2:
---------
.. code-block:: python

>>> import numpy as np
>>> from sklearn.model_selection import GroupTimeSeriesSplit
>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
'b', 'b', 'b', 'b', 'b',\
'c', 'c', 'c', 'c',\
'd', 'd', 'd'])
>>> gtss = GroupTimeSeriesSplit(n_splits=2, test_size=1, gap=1,\
max_train_size=3)
>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
... print("TRAIN:", train_idx, "TEST:", test_idx)
... print("TRAIN GROUP:", groups[train_idx],\
"TEST GROUP:", groups[test_idx])
TRAIN: [0, 1, 2, 3, 4, 5] TEST: [11, 12, 13, 14]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a'] TEST GROUP: ['c' 'c' 'c' 'c']
TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [15, 16, 17]
TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
TEST GROUP: ['d' 'd' 'd']

Methods:
--------
| **get_n_splits([X, y, groups])**
|
| Returns the number of splitting iterations in the cross-validator
| *Parameters:*
| *X: object*
| Always ignored, exists for compatibility.
| *y: object*
| Always ignored, exists for compatibility.
| *groups: object*
| Always ignored, exists for compatibility.
| *Returns:*
| *n_splits: int*
| Returns the number of splitting iterations in the cross-validator.
|
| **split(X[groups, y])**
|
| Generate indices to split data into training and test set by group.
| *Parameters:*
| *X : array-like of shape (n_samples, n_features)*
| Training data, where n_samples is the number of samples
| and n_features is the number of features.
| *y : array-like of shape (n_samples,)*
| Always ignored, exists for compatibility.
| *groups : array-like of shape (n_samples,)*
| Group labels for the samples used while splitting the dataset into
| train/test set.
| *Yields:*
| *train : ndarray*
| The training set indices for that split.
| *test : ndarray*
| The testing set indices for that split.

6 changes: 3 additions & 3 deletions sklearn/model_selection/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@
from ._split import ShuffleSplit
from ._split import GroupShuffleSplit
from ._split import StratifiedShuffleSplit
from ._split import StratifiedGroupKFold
from ._split import PredefinedSplit
from ._split import train_test_split
from ._split import check_cv
from ._split import GroupTimeSeriesSplit

from ._validation import cross_val_score
from ._validation import cross_val_predict
Expand Down Expand Up @@ -58,7 +58,6 @@
'RandomizedSearchCV',
'ShuffleSplit',
'StratifiedKFold',
'StratifiedGroupKFold',
'StratifiedShuffleSplit',
'check_cv',
'cross_val_predict',
Expand All @@ -68,4 +67,5 @@
'learning_curve',
'permutation_test_score',
'train_test_split',
'validation_curve']
'validation_curve',
'GroupTimeSeriesSplit']
Loading
0