yuvipanda
diff --git a/‎doc/modules/cross_validation.rst
Lines changed: 31 additions & 28 deletions b/‎doc/modules/cross_validation.rst
Lines changed: 31 additions & 28 deletions
diff --git a/‎sklearn/model_selection/_split.py
Lines changed: 1 addition & 1 deletion b/‎sklearn/model_selection/_split.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎sklearn/model_selection/_validation.py
Lines changed: 1 addition & 1 deletion b/‎sklearn/model_selection/_validation.py
Lines changed: 1 addition & 1 deletion
@@ -947,49 +947,52 @@ Permutation test score
 ======================
 
 :func:`~sklearn.model_selection.permutation_test_score` offers another way
-to evaluate the performance of classifiers. It provides a permutation-based
-p-value, which represents how likely an observed performance of the
-classifier would be obtained by chance. The null hypothesis in this test is
-that the classifier fails to leverage any statistical dependency between the
-features and the labels to make correct predictions on left out data.
+to evaluate the performance of a :term:`predictor`. It provides a
+permutation-based p-value, which represents how likely an observed performance of the
+estimator would be obtained by chance. The null hypothesis in this test is
+that the estimator fails to leverage any statistical dependency between the
+features and the targets to make correct predictions on left-out data.
 :func:`~sklearn.model_selection.permutation_test_score` generates a null
 distribution by calculating `n_permutations` different permutations of the
-data. In each permutation the labels are randomly shuffled, thereby removing
-any dependency between the features and the labels. The p-value output
-is the fraction of permutations for which the average cross-validation score
-obtained by the model is better than the cross-validation score obtained by
-the model using the original data. For reliable results ``n_permutations``
-should typically be larger than 100 and ``cv`` between 3-10 folds.
-
-A low p-value provides evidence that the dataset contains real dependency
-between features and labels and the classifier was able to utilize this
-to obtain good results. A high p-value could be due to a lack of dependency
-between features and labels (there is no difference in feature values between
-the classes) or because the classifier was not able to use the dependency in
-the data. In the latter case, using a more appropriate classifier that
-is able to utilize the structure in the data, would result in a lower
-p-value.
-
-Cross-validation provides information about how well a classifier generalizes,
-specifically the range of expected errors of the classifier. However, a
-classifier trained on a high dimensional dataset with no structure may still
+data. In each permutation the target values are randomly shuffled, thereby removing
+any dependency between the features and the targets. The p-value output is the fraction
+of permutations whose cross-validation score is better or equal than the true score
+without permuting targets. For reliable results ``n_permutations`` should typically be
+larger than 100 and ``cv`` between 3-10 folds.
+
+A low p-value provides evidence that the dataset contains some real dependency between
+features and targets **and** that the estimator was able to utilize this dependency to
+obtain good results. A high p-value, in reverse, could be due to either one of these:
+
+- a lack of dependency between features and targets (i.e., there is no systematic
+  relationship and any observed patterns are likely due to random chance)
+- **or** because the estimator was not able to use the dependency in the data (for
+  instance because it underfit).
+
+In the latter case, using a more appropriate estimator that is able to use the
+structure in the data, would result in a lower p-value.
+
+Cross-validation provides information about how well an estimator generalizes
+by estimating the range of its expected scores. However, an
+estimator trained on a high dimensional dataset with no structure may still
 perform better than expected on cross-validation, just by chance.
 This can typically happen with small datasets with less than a few hundred
 samples.
 :func:`~sklearn.model_selection.permutation_test_score` provides information
-on whether the classifier has found a real class structure and can help in
-evaluating the performance of the classifier.
+on whether the estimator has found a real dependency between features and targets and
+can help in evaluating the performance of the estimator.
 
 It is important to note that this test has been shown to produce low
 p-values even if there is only weak structure in the data because in the
 corresponding permutated datasets there is absolutely no structure. This
-test is therefore only able to show when the model reliably outperforms
+test is therefore only able to show whether the model reliably outperforms
 random guessing.
 
 Finally, :func:`~sklearn.model_selection.permutation_test_score` is computed
 using brute force and internally fits ``(n_permutations + 1) * n_cv`` models.
 It is therefore only tractable with small datasets for which fitting an
-individual model is very fast.
+individual model is very fast. Using the `n_jobs` parameter parallelizes the
+computation and thus speeds it up.
 
 .. rubric:: Examples
 
 
@@ -17,7 +17,8 @@
 # -------
 #
 # We will use the :ref:`iris_dataset`, which consists of measurements taken
-# from 3 types of irises.
+# from 3 Iris species. Our model will use the measurements to predict
+# the iris species.
 
 from sklearn.datasets import load_iris
 
@@ -26,7 +27,7 @@
 y = iris.target
 
 # %%
-# We will also generate some random feature data (i.e., 20 features),
+# For comparison, we also generate some random feature data (i.e., 20 features),
 # uncorrelated with the class labels in the iris dataset.
 
 import numpy as np
@@ -41,27 +42,28 @@
 # ----------------------
 #
 # Next, we calculate the
-# :func:`~sklearn.model_selection.permutation_test_score` using the original
-# iris dataset, which strongly predict the labels and
-# the randomly generated features and iris labels, which should have
-# no dependency between features and labels. We use the
+# :func:`~sklearn.model_selection.permutation_test_score` for both, the original
+# iris dataset (where there's a strong relationship between features and labels) and
+# the randomly generated features with iris labels (where no dependency between features
+# and labels is expected). We use the
 # :class:`~sklearn.svm.SVC` classifier and :ref:`accuracy_score` to evaluate
 # the model at each round.
 #
 # :func:`~sklearn.model_selection.permutation_test_score` generates a null
 # distribution by calculating the accuracy of the classifier
 # on 1000 different permutations of the dataset, where features
-# remain the same but labels undergo different permutations. This is the
+# remain the same but labels undergo different random permutations. This is the
 # distribution for the null hypothesis which states there is no dependency
 # between the features and labels. An empirical p-value is then calculated as
-# the percentage of permutations for which the score obtained is greater
-# that the score obtained using the original data.
+# the proportion of permutations, for which the score obtained by the model trained on
+# the permutation, is greater than or equal to the score obtained using the original
+# data.
 
 from sklearn.model_selection import StratifiedKFold, permutation_test_score
 from sklearn.svm import SVC
 
 clf = SVC(kernel="linear", random_state=7)
-cv = StratifiedKFold(2, shuffle=True, random_state=0)
+cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
 
 score_iris, perm_scores_iris, pvalue_iris = permutation_test_score(
     clf, X, y, scoring="accuracy", cv=cv, n_permutations=1000
@@ -77,20 +79,22 @@
 #
 # Below we plot a histogram of the permutation scores (the null
 # distribution). The red line indicates the score obtained by the classifier
-# on the original data. The score is much better than those obtained by
-# using permuted data and the p-value is thus very low. This indicates that
+# on the original data (without permuted labels). The score is much better than those
+# obtained by using permuted data and the p-value is thus very low. This indicates that
 # there is a low likelihood that this good score would be obtained by chance
 # alone. It provides evidence that the iris dataset contains real dependency
 # between features and labels and the classifier was able to utilize this
-# to obtain good results.
+# to obtain good results. The low p-value can lead us to reject the null hypothesis.
 
 import matplotlib.pyplot as plt
 
 fig, ax = plt.subplots()
 
 ax.hist(perm_scores_iris, bins=20, density=True)
 ax.axvline(score_iris, ls="--", color="r")
-score_label = f"Score on original\ndata: {score_iris:.2f}\n(p-value: {pvalue_iris:.3f})"
+score_label = (
+    f"Score on original\niris data: {score_iris:.2f}\n(p-value: {pvalue_iris:.3f})"
+)
 ax.text(0.7, 10, score_label, fontsize=12)
 ax.set_xlabel("Accuracy score")
 _ = ax.set_ylabel("Probability density")
@@ -101,28 +105,32 @@
 #
 # Below we plot the null distribution for the randomized data. The permutation
 # scores are similar to those obtained using the original iris dataset
-# because the permutation always destroys any feature label dependency present.
-# The score obtained on the original randomized data in this case though, is
-# very poor. This results in a large p-value, confirming that there was no
-# feature label dependency in the original data.
+# because the permutation always destroys any feature-label dependency present.
+# The score obtained on the randomized data in this case
+# though, is very poor. This results in a large p-value, confirming that there was no
+# feature-label dependency in the randomized data.
 
 fig, ax = plt.subplots()
 
 ax.hist(perm_scores_rand, bins=20, density=True)
 ax.set_xlim(0.13)
 ax.axvline(score_rand, ls="--", color="r")
-score_label = f"Score on original\ndata: {score_rand:.2f}\n(p-value: {pvalue_rand:.3f})"
+score_label = (
+    f"Score on original\nrandom data: {score_rand:.2f}\n(p-value: {pvalue_rand:.3f})"
+)
 ax.text(0.14, 7.5, score_label, fontsize=12)
 ax.set_xlabel("Accuracy score")
 ax.set_ylabel("Probability density")
 plt.show()
 
 # %%
-# Another possible reason for obtaining a high p-value is that the classifier
+# Another possible reason for obtaining a high p-value could be that the classifier
 # was not able to use the structure in the data. In this case, the p-value
 # would only be low for classifiers that are able to utilize the dependency
 # present. In our case above, where the data is random, all classifiers would
-# have a high p-value as there is no structure present in the data.
+# have a high p-value as there is no structure present in the data. We might or might
+# not fail to reject the null hypothesis depending on whether the p-value is high on a
+# more appropriate estimator as well.
 #
 # Finally, note that this test has been shown to produce low p-values even
 # if there is only weak structure in the data [1]_.
 
@@ -1662,7 +1662,7 @@ def __repr__(self):
 class RepeatedKFold(_UnsupportedGroupCVMixin, _RepeatedSplits):
     """Repeated K-Fold cross validator.
 
-    Repeats K-Fold n times with different randomization in each repetition.
+    Repeats K-Fold `n_repeats` times with different randomization in each repetition.
 
     Read more in the :ref:`User Guide <repeated_k_fold>`.
 
 
@@ -1487,7 +1487,7 @@ def permutation_test_score(
     independent.
 
     The p-value represents the fraction of randomized data sets where the
-    estimator performed as well or better than in the original data. A small
+    estimator performed as well or better than on the original data. A small
     p-value suggests that there is a real dependency between features and
     targets which has been used by the estimator to give good predictions.
     A large p-value may be due to lack of real dependency between features