scikit-learn · soso-song · Apr 20, 2021 · Apr 24, 2021 · Apr 25, 2021 · Apr 25, 2021
diff --git a/doc/modules/cross_validation.rst b/doc/modules/cross_validation.rst
@@ -219,7 +219,7 @@ following keys -
 ``['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']``
 
 ``return_train_score`` is set to ``False`` by default to save computation time.
-To evaluate the scores on the training set as well you need to set it to
+To evaluate the scores on the training set as well you need to be set to
 ``True``.
 
 You may also retain the estimator fitted on each training set by setting
@@ -353,7 +353,7 @@ Example of 2-fold cross-validation on a dataset with 4 samples::
 Here is a visualization of the cross-validation behavior. Note that
 :class:`KFold` is not affected by classes or groups.
 
-.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_006.png
+.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_004.png
    :target: ../auto_examples/model_selection/plot_cv_indices.html
    :align: center
    :scale: 75%
@@ -509,7 +509,7 @@ Here is a usage example::
 Here is a visualization of the cross-validation behavior. Note that
 :class:`ShuffleSplit` is not affected by classes or groups.
 
-.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_008.png
+.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_006.png
    :target: ../auto_examples/model_selection/plot_cv_indices.html
    :align: center
    :scale: 75%
@@ -566,7 +566,7 @@ We can see that :class:`StratifiedKFold` preserves the class ratios
 
 Here is a visualization of the cross-validation behavior.
 
-.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_009.png
+.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_007.png
    :target: ../auto_examples/model_selection/plot_cv_indices.html
    :align: center
    :scale: 75%
@@ -585,7 +585,7 @@ percentage for each target class as in the complete set.
 
 Here is a visualization of the cross-validation behavior.
 
-.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_012.png
+.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_009.png
    :target: ../auto_examples/model_selection/plot_cv_indices.html
    :align: center
    :scale: 75%
@@ -645,58 +645,6 @@ size due to the imbalance in the data.
 
 Here is a visualization of the cross-validation behavior.
 
-.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_007.png
-   :target: ../auto_examples/model_selection/plot_cv_indices.html
-   :align: center
-   :scale: 75%
-
-.. _stratified_group_k_fold:
-
-StratifiedGroupKFold
-^^^^^^^^^^^^^^^^^^^^
-
-:class:`StratifiedGroupKFold` is a cross-validation scheme that combines both
-:class:`StratifiedKFold` and :class:`GroupKFold`. The idea is to try to
-preserve the distribution of classes in each split while keeping each group
-within a single split. That might be useful when you have an unbalanced
-dataset so that using just :class:`GroupKFold` might produce skewed splits.
-
-Example::
-
-  >>> from sklearn.model_selection import StratifiedGroupKFold
-  >>> X = list(range(18))
-  >>> y = [1] * 6 + [0] * 12
-  >>> groups = [1, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 4, 5, 5, 5, 6, 6, 6]
-  >>> sgkf = StratifiedGroupKFold(n_splits=3)
-  >>> for train, test in sgkf.split(X, y, groups=groups):
-  ...     print("%s %s" % (train, test))
-  [ 0  2  3  4  5  6  7 10 11 15 16 17] [ 1  8  9 12 13 14]
-  [ 0  1  4  5  6  7  8  9 11 12 13 14] [ 2  3 10 15 16 17]
-  [ 1  2  3  8  9 10 12 13 14 15 16 17] [ 0  4  5  6  7 11]
-
-Implementation notes:
-
-- With the current implementation full shuffle is not possible in most
-  scenarios. When shuffle=True, the following happens:
-
-  1. All groups a shuffled.
-  2. Groups are sorted by standard deviation of classes using stable sort.
-  3. Sorted groups are iterated over and assigned to folds.
-
-  That means that only groups with the same standard deviation of class
-  distribution will be shuffled, which might be useful when each group has only
-  a single class.
-- The algorithm greedily assigns each group to one of n_splits test sets,
-  choosing the test set that minimises the variance in class distribution
-  across test sets. Group assignment proceeds from groups with highest to
-  lowest variance in class frequency, i.e. large groups peaked on one or few
-  classes are assigned first.
-- This split is suboptimal in a sense that it might produce imbalanced splits
-  even if perfect stratification is possible. If you have relatively close
-  distribution of classes in each group, using :class:`GroupKFold` is better.
-
-Here is a visualization of cross-validation behavior for uneven groups:
-
 .. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_005.png
    :target: ../auto_examples/model_selection/plot_cv_indices.html
    :align: center
@@ -785,7 +733,7 @@ Here is a usage example::
 
 Here is a visualization of the cross-validation behavior.
 
-.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_011.png
+.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_008.png
    :target: ../auto_examples/model_selection/plot_cv_indices.html
    :align: center
    :scale: 75%
@@ -813,7 +761,7 @@ samples that are part of the validation set, and to -1 for all other samples.
 Using cross-validation iterators to split train and test
 --------------------------------------------------------
 
-The above group cross-validation functions may also be useful for splitting a
+The above group cross-validation functions may also be useful for spitting a
 dataset into training and testing subsets. Note that the convenience
 function :func:`train_test_split` is a wrapper around :func:`ShuffleSplit`
 and thus only allows for stratified splitting (using the class labels)
@@ -887,11 +835,83 @@ Example of 3-split time series cross-validation on a dataset with 6 samples::
 
 Here is a visualization of the cross-validation behavior.
 
-.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_013.png
+.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_010.png
    :target: ../auto_examples/model_selection/plot_cv_indices.html
    :align: center
    :scale: 75%
 
+Group Time Series Split
+^^^^^^^^^^^^^^^^^^^^^^^
+
+:class:`GroupTimeSeriesSplit` combines :class:`TimeSeriesSplit` with the Group awareness
+of `GroupKFold`. Like :class:`TimeSeriesSplit` this also returns first :math:`k` folds
+as train set and the :math:`(k+1)` th fold as test set.
+Successive training sets are supersets of those that come before them.
+Also, it adds all surplus data to the first training partition, which
+is always used to train the model.
+This class can be used to cross-validate time series data samples
+that are observed at fixed time intervals.
+
+The same group will not appear in two different folds (the number of
+distinct groups has to be at least equal to the number of folds).
+
+The groups should be Continuous like below.
+['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd']
+
+Non-continuous groups like below will give an error.
+['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'b', 'a', 'c', 'c', 'c', 'b', 'd', 'd']
+
+`GroupTimeSeriesSplit` is useful in cases where we have time series data for
+say multiple days with multiple data points within a day. 
+During cross-validation we may not want the training days to be be used in testing.
+Here the days can act as groups to keep the training and test splits separate.
+
+Example of 3-split time series cross-validation on a dataset with
+18 samples and 4 groups::
+
+    >>> import numpy as np
+    >>> from sklearn.model_selection import GroupTimeSeriesSplit
+    >>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',
+    ...                    'b', 'b', 'b', 'b', 'b',
+    ...                    'c', 'c', 'c', 'c',
+    ...                    'd', 'd', 'd'])
+    >>> gtss = GroupTimeSeriesSplit(n_splits=3)
+    >>> for train_idx, test_idx in gtss.split(groups, groups=groups):
+    ...     print("TRAIN:", train_idx, "TEST:", test_idx)
+    ...     print("TRAIN GROUP:", groups[train_idx],
+    ...           "TEST GROUP:", groups[test_idx])
+    TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
+    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']
+    TEST GROUP: ['b' 'b' 'b' 'b' 'b']
+    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
+    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
+    TEST GROUP: ['c' 'c' 'c' 'c']
+    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
+    TEST: [15, 16, 17]
+    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']
+    TEST GROUP: ['d' 'd' 'd']
+
+Example of 2-split time series cross-validation on a dataset with
+18 samples and 4 groups and 1 test_size and 3 max_train_size and 1 period gap::
+
+    >>> import numpy as np
+    >>> from sklearn.model_selection import GroupTimeSeriesSplit
+    >>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
+                           'b', 'b', 'b', 'b', 'b',\
+                           'c', 'c', 'c', 'c',\
+                           'd', 'd', 'd'])
+    >>> gtss = GroupTimeSeriesSplit(n_splits=2, test_size=1, gap=1,\
+                                    max_train_size=3)
+    >>> for train_idx, test_idx in gtss.split(groups, groups=groups):
+    ...     print("TRAIN:", train_idx, "TEST:", test_idx)
+    ...     print("TRAIN GROUP:", groups[train_idx],\
+                  "TEST GROUP:", groups[test_idx])
+    TRAIN: [0, 1, 2, 3, 4, 5] TEST: [11, 12, 13, 14]
+    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a'] TEST GROUP: ['c' 'c' 'c' 'c']
+    TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [15, 16, 17]
+    TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
+    TEST GROUP: ['d' 'd' 'd']
+
 A note on shuffling
 ===================
 

diff --git a/doc/modules/group_time_series_split.rst b/doc/modules/group_time_series_split.rst
@@ -0,0 +1,121 @@
+
+.. _GroupTimeSeriesSplit:
+
+=================================================
+sklearn.model_selection.GroupTimeSeriesSplit
+=================================================
+.. code-block:: python
+
+   class sklearn.model_selection.GroupTimeSeriesSplit(n_splits=5, *, max_train_size=None, test_size=None, gap=0)
+
+| *GroupTimeSeriesSplit* combines *TimeSeriesSplit* with the Group awareness of *GroupKFold*.
+|
+| Like *TimeSeriesSplit* this  also returns first *k* folds as train set and the *(k+1)* th fold as test set.
+|
+| Since the Group applies on this class, the same group will not appear in two different
+ folds(the number of distinct groups has to be at least equal to the number of folds) which make sure the i.i.d. assumption will not be broken.
+
+| All operations of this CV strategy are done at the group level.
+| So all our parameters, not limited to splits, including test_size, gap, and max_train_size, all represent the constraints on the number of groups.
+
+
+Parameters: 
+-----------
+| **n_splits;int,default=5**
+|
+|   Number of splits. Must be at least 2.
+|
+| **max_train_size:int, default=None**
+|
+|   Maximum number of groups for a single training set.
+|
+| **test_size:int, default=None**
+|
+|   Used to limit the number of groups in the test set. Defaults to ``n_samples // (n_splits + 1)``, which is the maximum allowed value with ``gap=0``.
+|
+| **gap:int, default=0**
+|
+|  Number of groups in samples to exclude from the end of each train set before the test set.
+
+Example 1:
+---------
+.. code-block:: python
+
+>>> import numpy as np
+>>> from sklearn.model_selection import GroupTimeSeriesSplit
+>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',
+...                    'b', 'b', 'b', 'b', 'b',
+...                    'c', 'c', 'c', 'c',
+...                    'd', 'd', 'd'])
+>>> gtss = GroupTimeSeriesSplit(n_splits=3)
+>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
+...     print("TRAIN:", train_idx, "TEST:", test_idx)
+...     print("TRAIN GROUP:", groups[train_idx],
+...           "TEST GROUP:", groups[test_idx])
+TRAIN: [0, 1, 2, 3, 4, 5] TEST: [6, 7, 8, 9, 10]
+TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a']
+TEST GROUP: ['b' 'b' 'b' 'b' 'b']
+TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [11, 12, 13, 14]
+TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
+TEST GROUP: ['c' 'c' 'c' 'c']
+TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
+TEST: [15, 16, 17]
+TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c']
+TEST GROUP: ['d' 'd' 'd']
+
+Example 2:
+---------
+.. code-block:: python
+
+>>> import numpy as np
+>>> from sklearn.model_selection import GroupTimeSeriesSplit
+>>> groups = np.array(['a', 'a', 'a', 'a', 'a', 'a',\
+                       'b', 'b', 'b', 'b', 'b',\
+                       'c', 'c', 'c', 'c',\
+                       'd', 'd', 'd'])
+>>> gtss = GroupTimeSeriesSplit(n_splits=2, test_size=1, gap=1,\
+                                max_train_size=3)
+>>> for train_idx, test_idx in gtss.split(groups, groups=groups):
+...     print("TRAIN:", train_idx, "TEST:", test_idx)
+...     print("TRAIN GROUP:", groups[train_idx],\
+              "TEST GROUP:", groups[test_idx])
+TRAIN: [0, 1, 2, 3, 4, 5] TEST: [11, 12, 13, 14]
+TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a'] TEST GROUP: ['c' 'c' 'c' 'c']
+TRAIN: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] TEST: [15, 16, 17]
+TRAIN GROUP: ['a' 'a' 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'b']
+TEST GROUP: ['d' 'd' 'd']
+
+Methods: 
+--------
+| **get_n_splits([X, y, groups])**
+|
+|   Returns the number of splitting iterations in the cross-validator
+|   *Parameters:*
+|       *X: object*
+|           Always ignored, exists for compatibility.
+|       *y: object*
+|           Always ignored, exists for compatibility.
+|       *groups: object*
+|           Always ignored, exists for compatibility.
+|   *Returns:*
+|       *n_splits: int*
+|           Returns the number of splitting iterations in the cross-validator.
+|
+| **split(X[groups, y])**
+|
+|   Generate indices to split data into training and test set by group.
+|   *Parameters:*
+|       *X : array-like of shape (n_samples, n_features)*
+|            Training data, where n_samples is the number of samples
+|            and n_features is the number of features.
+|       *y : array-like of shape (n_samples,)*
+|            Always ignored, exists for compatibility.
+|       *groups : array-like of shape (n_samples,)*
+|            Group labels for the samples used while splitting the dataset into
+|            train/test set.
+|   *Yields:*
+|       *train : ndarray*
+|            The training set indices for that split.
+|       *test : ndarray*
+|            The testing set indices for that split.
+
diff --git a/sklearn/model_selection/__init__.py b/sklearn/model_selection/__init__.py
@@ -14,10 +14,10 @@
 from ._split import ShuffleSplit
 from ._split import GroupShuffleSplit
 from ._split import StratifiedShuffleSplit
-from ._split import StratifiedGroupKFold
 from ._split import PredefinedSplit
 from ._split import train_test_split
 from ._split import check_cv
+from ._split import GroupTimeSeriesSplit
 
 from ._validation import cross_val_score
 from ._validation import cross_val_predict
@@ -58,7 +58,6 @@
            'RandomizedSearchCV',
            'ShuffleSplit',
            'StratifiedKFold',
-           'StratifiedGroupKFold',
            'StratifiedShuffleSplit',
            'check_cv',
            'cross_val_predict',
@@ -68,4 +67,5 @@
            'learning_curve',
            'permutation_test_score',
            'train_test_split',
-           'validation_curve']
+           'validation_curve',
+           'GroupTimeSeriesSplit']