8000 DOC readability and clarity on `permutation_test_score` in userguide … · yuvipanda/scikit-learn@36ad7b3 · GitHub
[go: up one dir, main page]

Skip to content

Commit 36ad7b3

Browse files
StefanieSengerlucyleeowadrinjalaliglemaitre
authored
DOC readability and clarity on permutation_test_score in userguide and example (scikit-learn#30351)
Co-authored-by: Lucy Liu <jliu176@gmail.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com> Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
1 parent 5b0ca39 commit 36ad7b3

File tree

4 files changed

+62
-51
lines changed

4 files changed

+62
-51
lines changed

doc/modules/cross_validation.rst

Lines changed: 31 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -947,49 +947,52 @@ Permutation test score
947947
======================
948948

949949
:func:`~sklearn.model_selection.permutation_test_score` offers another way
950-
to evaluate the performance of classifiers. It provides a permutation-based
951-
p-value, which represents how likely an observed performance of the
952-
classifier would be obtained by chance. The null hypothesis in this test is
953-
that the classifier fails to leverage any statistical dependency between the
954-
features and the labels to make correct predictions on left out data.
950+
to evaluate the performance of a :term:`predictor`. It provides a
951+
permutation-based p-value, which represents how likely an observed performance of the
952+
estimator would be obtained by chance. The null hypothesis in this test is
953+
that the estimator fails to leverage any statistical dependency between the
954+
features and the targets to make correct predictions on left-out data.
955955
:func:`~sklearn.model_selection.permutation_test_score` generates a null
956956
distribution by calculating `n_permutations` different permutations of the
957-
data. In each permutation the labels are randomly shuffled, thereby removing
958-
any dependency between the features and the labels. The p-value output
959-
is the fraction of permutations for which the average cross-validation score
960-
obtained by the model is better than the cross-validation score obtained by
961-
the model using the original data. For reliable results ``n_permutations``
962-
should typically be larger than 100 and ``cv`` between 3-10 folds.
963-
964-
A low p-value provides evidence that the dataset contains real dependency
965-
between features and labels and the classifier was able to utilize this
966-
to obtain good results. A high p-value could be due to a lack of dependency
967-
between features and labels (there is no difference in feature values between
968-
the classes) or because the classifier was not able to use the dependency in
969-
the data. In the latter case, using a more appropriate classifier that
970-
is able to utilize the structure in the data, would result in a lower
971-
p-value.
972-
973-
Cross-validation provides information about how well a classifier generalizes,
974-
specifically the range of expected errors of the classifier. However, a
975-
classifier trained on a high dimensional dataset with no structure may still
957+
data. In each permutation the target values are randomly shuffled, thereby removing
958+
any dependency between the features and the targets. The p-value output is the fraction
959+
of permutations whose cross-validation score is better or equal than the true score
960+
without permuting targets. For reliable results ``n_permutations`` should typically be
961+
larger than 100 and ``cv`` between 3-10 folds.
962+
963+
A low p-value provides evidence that the dataset contains some real dependency between
964+
features and targets **and** that the estimator was able to utilize this dependency to
965+
obtain good results. A high p-value, in reverse, could be due to either one of these:
966+
967+
- a lack of dependency between features and targets (i.e., there is no systematic
968+
relationship and any observed patterns are likely due to random chance)
969+
- **or** because the estimator was not able to use the dependency in the data (for
970+
instance because it underfit).
971+
972+
In the latter case, using a more appropriate estimator that is able to use the
973+
structure in the data, would result in a lower p-value.
974+
975+
Cross-validation provides information about how well an estimator generalizes
976+
by estimating the range of its expected scores. However, an
977+
estimator trained on a high dimensional dataset with no structure may still
976978
perform better than expected on cross-validation, just by chance.
977979
This can typically happen with small datasets with less than a few hundred
978980
samples.
979981
:func:`~sklearn.model_selection.permutation_test_score` provides information
980-
on whether the classifier has found a real class structure and can help in
981-
evaluating the performance of the classifier.
982+
on whether the estimator has found a real dependency between features and targets and
983+
can help in evaluating the performance of the estimator.
982984

983985
It is important to note that this test has been shown to produce low
984986
p-values even if there is only weak structure in the data because in the
985987
corresponding permutated datasets there is absolutely no structure. This
986-
test is therefore only able to show when the model reliably outperforms
988+
test is therefore only able to show whether the model reliably outperforms
987989
random guessing.
988990

989991
Finally, :func:`~sklearn.model_selection.permutation_test_score` is computed
990992
using brute force and internally fits ``(n_permutations + 1) * n_cv`` models.
991993
It is therefore only tractable with small datasets for which fitting an
992-
individual model is very fast.
994+
individual model is very fast. Using the `n_jobs` parameter parallelizes the
995+
computation and thus speeds it up.
993996

994997
.. rubric:: Examples
995998

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@
1717
# -------
1818
#
1919
# We will use the :ref:`iris_dataset`, which consists of measurements taken
20-
# from 3 types of irises.
20+
# from 3 Iris species. Our model will use the measurements to predict
21+
# the iris species.
2122

2223
from sklearn.datasets import load_iris
2324

@@ -26,7 +27,7 @@
2627
y = iris.target
2728

2829
# %%
29-
# We will also generate some random feature data (i.e., 20 features),
30+
# For comparison, we also generate some random feature data (i.e., 20 features),
3031
# uncorrelated with the class labels in the iris dataset.
3132

3233
import numpy as np
@@ -41,27 +42,28 @@
4142
# ----------------------
4243
#
4344
# Next, we calculate the
44-
# :func:`~sklearn.model_selection.permutation_test_score` using the original
45-
# iris dataset, which strongly predict the labels and
46-
# the randomly generated features and iris labels, which should have
47-
# no dependency between features and labels. We use the
45+
# :func:`~sklearn.model_selection.permutation_test_score` for both, the original
46+
# iris dataset (where there's a strong relationship between features and labels) and
47+
# the randomly generated features with iris labels (where no dependency between features
48+
# and labels is expected). We use the
4849
# :class:`~sklearn.svm.SVC` classifier and :ref:`accuracy_score` to evaluate
4950
# the model at each round.
5051
#
5152
# :func:`~sklearn.model_selection.permutation_test_score` generates a null
5253
# distribution by calculating the accuracy of the classifier
5354
# on 1000 different permutations of the dataset, where features
54-
# remain the same but labels undergo different permutations. This is the
55+
# remain the same but labels undergo different random permutations. This is the
5556
# distribution for the null hypothesis which states there is no dependency
5657
# between the features and labels. An empirical p-value is then calculated as
57-
# the percentage of permutations for which the score obtained is greater
58-
# that the score obtained using the original data.
58+
# the proportion of permutations, for which the score obtained by the model trained on
59+
# the permutation, is greater than or equal to the score obtained using the original
60+
# data.
5961

6062
from sklearn.model_selection import StratifiedKFold, permutation_test_score
6163
from sklearn.svm import SVC
6264

6365
clf = SVC(kernel="linear", random_state=7)
64-
cv = StratifiedKFold(2, shuffle=True, random_state=0)
66+
cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
6567

6668
score_iris, perm_scores_iris, pvalue_iris = permutation_test_score(
6769
clf, X, y, scoring="accuracy", cv=cv, n_permutations=1000
@@ -77,20 +79,22 @@
7779
#
7880
# Below we plot a histogram of the permutation scores (the null
7981
# distribution). The red line indicates the score obtained by the classifier
80-
# on the original data. The score is much better than those obtained by
81-
# using permuted data and the p-value is thus very low. This indicates that
82+
# on the original data (without permuted labels). The score is much better than those
83+
# obtained by using permuted data and the p-value is thus very low. This indicates that
8284
# there is a low likelihood that this good score would be obtained by chance
8385
# alone. It provides evidence that the iris dataset contains real dependency
8486
# between features and labels and the classifier was able to utilize this
85-
# to obtain good results.
87+
# to obtain good results. The low p-value can lead us to reject the null hypothesis.
8688

8789
import matplotlib.pyplot as plt
8890

8991
fig, ax = plt.subplots()
9092

9193
ax.hist(perm_scores_iris, bins=20, density=True)
9294
ax.axvline(score_iris, ls="--", color="r")
93-
score_label = f"Score on original\ndata: {score_iris:.2f}\n(p-value: {pvalue_iris:.3f})"
95+
score_label = (
96+
f"Score on original\niris data: {score_iris:.2f}\n(p-value: {pvalue_iris:.3f})"
97+
)
9498
ax.text(0.7, 10, score_label, fontsize=12)
9599
ax.set_xlabel("Accuracy score")
96100
_ = ax.set_ylabel("Probability density")
@@ -101,28 +105,32 @@
101105
#
102106
# Below we plot the null distribution for the randomized data. The permutation
103107
# scores are similar to those obtained using the original iris dataset
104-
# because the permutation always destroys any feature label dependency present.
105-
# The score obtained on the original randomized data in this case though, is
106-
# very poor. This results in a large p-value, confirming that there was no
107-
# feature label dependency in the original data.
108+
# because the permutation always destroys any feature-label dependency present.
109+
# The score obtained on the randomized data in this case
110+
# though, is very poor. This results in a large p-value, confirming that there was no
111+
# feature-label dependency in the randomized data.
108112

109113
fig, ax = plt.subplots()
110114

111115
ax.hist(perm_scores_rand, bins=20, density=True)
112116
ax.set_xlim(0.13)
113117
ax.axvline(score_rand, ls="--", color="r")
114-
score_label = f"Score on original\ndata: {score_rand:.2f}\n(p-value: {pvalue_rand:.3f})"
118+
score_label = (
119+
f"Score on original\nrandom data: {score_rand:.2f}\n(p-value: {pvalue_rand:.3f})"
120+
)
115121
ax.text(0.14, 7.5, score_label, fontsize=12)
116122
ax.set_xlabel("Accuracy score")
117123
ax.set_ylabel("Probability density")
118124
plt.show()
119125

120126
# %%
121-
# Another possible reason for obtaining a high p-value is that the classifier
127+
# Another possible reason for obtaining a high p-value could be that the classifier
122128
# was not able to use the structure in the data. In this case, the p-value
123129
# would only be low for classifiers that are able to utilize the dependency
124130
# present. In our case above, where the data is random, all classifiers would
125-
# have a high p-value as there is no structure present in the data.
131+
# have a high p-value as there is no structure present in the data. We might or might
132+
# not fail to reject the null hypothesis depending on whether the p-value is high on a
133+
# more appropriate estimator as well.
126134
#
127135
# Finally, note that this test has been shown to produce low p-values even
128136
# if there is only weak structure in the data [1]_.

sklearn/model_selection/_split.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1662,7 +1662,7 @@ def __repr__(self):
16621662
class RepeatedKFold(_UnsupportedGroupCVMixin, _RepeatedSplits):
16631663
"""Repeated K-Fold cross validator.
16641664
1665-
Repeats K-Fold n times with different randomization in each repetition.
1665+
Repeats K-Fold `n_repeats` times with different randomization in each repetition.
16661666
16671667
Read more in the :ref:`User Guide <repeated_k_fold>`.
16681668

sklearn/model_selection/_validation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1487,7 +1487,7 @@ def permutation_test_score(
14871487
independent.
14881488
14891489
The p-value represents the fraction of randomized data sets where the
1490-
estimator performed as well or better than in the original data. A small
1490+
estimator performed as well or better than on the original data. A small
14911491
p-value suggests that there is a real dependency between features and
14921492
targets which has been used by the estimator to give good predictions.
14931493
A large p-value may be due to lack of real dependency between features

0 commit comments

Comments
 (0)
0