8000 Merge branch 'master' into news0.20 · scikit-learn/scikit-learn@5a98857 · GitHub
[go: up one dir, main page]

Skip to content

Commit 5a98857

Browse files
authored
Merge branch 'master' into news0.20
2 parents 4404b6b + e500447 commit 5a98857

File tree

21 files changed

+538
-40
lines changed

21 files changed

+538
-40
lines changed

doc/developers/contributing.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,14 @@ feedback:
140140
your **Python, scikit-learn, numpy, and scipy versions**. This information
141141
can be found by running the following code snippet::
142142

143+
>>> import sklearn
144+
>>> sklearn.show_versions() # doctest: +SKIP
145+
146+
.. note::
147+
148+
This utility function is only available in scikit-learn v0.20+.
149+
For previous versions, one has to explicitly run::
150+
143151
import platform; print(platform.platform())
144152
import sys; print("Python", sys.version)
145153
import numpy; print("NumPy", numpy.__version__)

doc/developers/tips.rst

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -121,15 +121,11 @@ Issue: Self-contained example for bug
121121
Issue: Software versions
122122
::
123123

124-
To help diagnose your issue, could you please paste the output of:
124+
To help diagnose your issue, please paste the output of:
125125
```py
126-
import platform; print(platform.platform())
127-
import sys; print("Python", sys.version)
128-
import numpy; print("NumPy", numpy.__version__)
129-
import scipy; print("SciPy", scipy.__version__)
130-
import sklearn; print("Scikit-Learn", sklearn.__version__)
126+
import sklearn; sklearn.show_versions()
131127
```
132-
? Thanks.
128+
Thanks.
133129

134130
Issue: Code blocks
135131
::

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ Functions
4848
config_context
4949
get_config
5050
set_config
51+
show_versions
5152

5253
.. _calibration_ref:
5354

doc/modules/clustering.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1282,7 +1282,7 @@ following equation [VEB2009]_. In this equation,
12821282
:math:`b_j = |V_j|` (the number of elements in :math:`V_j`).
12831283

12841284

1285-
.. math:: E[\text{MI}(U,V)]=\sum_{i=1}^|U| \sum_{j=1}^|V| \sum_{n_{ij}=(a_i+b_j-N)^+
1285+
.. math:: E[\text{MI}(U,V)]=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \sum_{n_{ij}=(a_i+b_j-N)^+
12861286
}^{\min(a_i, b_j)} \frac{n_{ij}}{N}\log \left( \frac{ N.n_{ij}}{a_i b_j}\right)
12871287
\frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!
12881288
(N-a_i-b_j+n_{ij})!}

doc/modules/cross_validation.rst

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,14 @@ Example of 2-fold cross-validation on a dataset with 4 samples::
323323
[2 3] [0 1]
324324
[0 1] [2 3]
325325

326+
Here is a visualization of the cross-validation behavior. Note that
327+
:class:`KFold` is not affected by classes or groups.
328+
329+
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_004.png
330+
:target: ../auto_examples/model_selection/plot_cv_indices.html
331+
:align: center
332+
:scale: 75%
333+
326334
Each fold is constituted by two arrays: the first one is related to the
327335
*training set*, and the second one to the *test set*.
328336
Thus, one can create the training/test sets using numpy indexing::
@@ -471,6 +479,14 @@ Here is a usage example::
471479
[2 7 5 8 0 3 4] [6 1 9]
472480
[4 1 0 6 8 9 3] [5 2 7]
473481

482+
Here is a visualization of the cross-validation behavior. Note that
483+
:class:`ShuffleSplit` is not affected by classes or groups.
484+
485+
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_006.png
486+
:target: ../auto_examples/model_selection/plot_cv_indices.html
487+
:align: center
488+
:scale: 75%
489+
474490
:class:`ShuffleSplit` is thus a good alternative to :class:`KFold` cross
475491
validation that allows a finer control on the number of iterations and
476492
the proportion of samples on each side of the train / test split.
@@ -506,6 +522,13 @@ two slightly unbalanced classes::
506522
[0 1 3 4 5 8 9] [2 6 7]
507523
[0 1 2 4 5 6 7] [3 8 9]
508524

525+
Here is a visualization of the cross-validation behavior.
526+
527+
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_007.png
528+
:target: ../auto_examples/model_selection/plot_cv_indices.html
529+
:align: center
530+
:scale: 75%
531+
509532
:class:`RepeatedStratifiedKFold` can be used to repeat Stratified K-Fold n times
510533
with different randomization in each repetition.
511534

@@ -517,6 +540,13 @@ Stratified Shuffle Split
517540
stratified splits, *i.e* which creates splits by preserving the same
518541
percentage for each target class as in the complete set.
519542

543+
Here is a visualization of the cross-validation behavior.
544+
545+
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_009.png
546+
:target: ../auto_examples/model_selection/plot_cv_indices.html
547+
:align: center
548+
:scale: 75%
549+
520550
.. _group_cv:
521551

522552
Cross-validation iterators for grouped data.
@@ -569,6 +599,12 @@ Each subject is in a different testing fold, and the same subject is never in
569599
both testing and training. Notice that the folds do not have exactly the same
570600
size due to the imbalance in the data.
571601

602+
Here is a visualization of the cross-validation behavior.
603+
604+
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_005.png
605+
:target: ../auto_examples/model_selection/plot_cv_indices.html
606+
:align: center
607+
:scale: 75%
572608

573609
Leave One Group Out
574610
^^^^^^^^^^^^^^^^^^^
@@ -645,6 +681,13 @@ Here is a usage example::
645681
[2 3 4 5] [0 1 6 7]
646682
[4 5 6 7] [0 1 2 3]
647683

684+
Here is a visualization of the cross-validation behavior.
685+
686+
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_008.png
687+
:target: ../auto_examples/model_selection/plot_cv_indices.html
688+
:align: center
689+
:scale: 75%
690+
648691
This class is useful when the behavior of :class:`LeavePGroupsOut` is
649692
desired, but the number of groups is large enough that generating all
650693
possible partitions with :math:`P` groups withheld would be prohibitively
@@ -709,6 +752,12 @@ Example of 3-split time series cross-validation on a dataset with 6 samples::
709752
[0 1 2 3] [4]
710753
[0 1 2 3 4] [5]
711754

755+
Here is a visualization of the cross-validation behavior.
756+
757+
.. figure:: ../auto_examples/model_selection/images/sphx_glr_plot_cv_indices_010.png
758+
:target: ../auto_examples/model_selection/plot_cv_indices.html
759+
:align: center
760+
:scale: 75%
712761

713762
A note on shuffling
714763
===================

doc/whats_new/v0.20.rst

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,13 @@ Misc
216216
- An environment variable to use the site joblib instead of the vendored
217217
one was added (:ref:`environment_variable`). The main API of joblib is now
218218
exposed in :mod:`sklearn.utils`.
219-
:issue:`11166`by `Gael Varoquaux`_.
219+
:issue:`11166` by `Gael Varoquaux`_.
220+
221+
- A utility method :func:`sklearn.show_versions()` was added to print out
222+
information relevant for debugging. It includes the user system, the
223+
Python executable, the version of the main libraries and BLAS binding
224+
information.
225+
:issue:`11596` by :user:`Alexandre Boucaud <aboucaud>`
220226

221227
Enhancements
222228
............
@@ -657,10 +663,11 @@ Bug fixes
657663
and set by default to 5. Previous behavior is equivalent to setting the
658664
parameter to 1. :issue:`9043` by `Tom Dupre la Tour`_.
659665

660-
- Fixed a bug where liblinear and libsvm-based estimators would segfault if
661-
passed a scipy.sparse matrix with 64-bit indices. They now raise a
662-
ValueError.
663-
:issue:`11327` by :user:`Karan Dhingra <kdhingra307>` and `Joel Nothman`_.
666+
- Fixed a bug in :func:`logistic.logistic_regression_path` to ensure that the
667+
returned coefficients are correct when ``multiclass='multinomial'``.
668+
Previously, some of the coefficients would override each other, leading to
669+
incorrect results in :class:`logistic.LogisticRegressionCV`. :issue:`11724`
670+
by :user:`Nicolas Hug <NicolasHug>`.
664671

665672
:mod:`metrics`
666673

@@ -799,6 +806,11 @@ Miscellaneous
799806
- Fixed a bug when setting parameters on meta-estimator, involving both a
800807
wrapped estimator and its parameter. :issue:`9999` by :user:`Marcus Voss
801808
<marcus-voss>` and `Joel Nothman`_.
809+
810+
- Fixed a bug where liblinear and libsvm-based estimators would segfault if
811+
passed a scipy.sparse matrix with 64-bit indices. They now raise a
812+
ValueError.
813+
:issue:`11327` by :user:`Karan Dhingra <kdhingra307>` and `Joel Nothman`_.
802814

803815
API changes summary
804816
-------------------
Lines changed: 149 additions & 0 deletions
10000
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
"""
2+
Visualizing cross-validation behavior in scikit-learn
3+
=====================================================
4+
5+
Choosing the right cross-validation object is a crucial part of fitting a
6+
model properly. There are many ways to split data into training and test
7+
sets in order to avoid model overfitting, to standardize the number of
8+
groups in test sets, etc.
9+
10+
This example visualizes the behavior of several common scikit-learn objects
11+
for comparison.
12+
"""
13+
14+
from sklearn.model_selection import (TimeSeriesSplit, KFold, ShuffleSplit,
15+
StratifiedKFold, GroupShuffleSplit,
16+
GroupKFold, StratifiedShuffleSplit)
17+
import numpy as np
18+
import matplotlib.pyplot as plt
19+
from matplotlib.patches import Patch
20+
np.random.seed(1338)
21+
cmap_data = plt.cm.Paired
22+
cmap_cv = plt.cm.coolwarm
23+
n_splits = 4
24+
25+
###############################################################################
26+
# Visualize our data
27+
# ------------------
28+
#
29+
# First, we must understand the structure of our data. It has 100 randomly
30+
# generated input datapoints, 3 classes split unevenly across datapoints,
31+
# and 10 "groups" split evenly across datapoints.
32+
#
33+
# As we'll see, some cross-validation objects do specific things with
34+
# labeled data, others behave differently with grouped data, and others
35+
# do not use this information.
36+
#
37+
# To begin, we'll visualize our data.
38+
39+
# Generate the class/group data
40+
n_points = 100
41+
X = np.random.randn(100, 10)
42+
43+
percentiles_classes = [.1, .3, .6]
44+
y = np.hstack([[ii] * int(100 * perc)
45+
for ii, perc in enumerate(percentiles_classes)])
46+
47+
# Evenly spaced groups repeated once
48+
groups = np.hstack([[ii] * 10 for ii in range(10)])
49+
50+
51+
def visualize_groups(classes, groups, name):
52+
# Visualize dataset groups
53+
fig, ax = plt.subplots()
54+
ax.scatter(range(len(groups)), [.5] * len(groups), c=groups, marker='_',
55+
lw=50, cmap=cmap_data)
56+
ax.scatter(range(len(groups)), [3.5] * len(groups), c=classes, marker='_',
57+
lw=50, cmap=cmap_data)
58+
ax.set(ylim=[-1, 5], yticks=[.5, 3.5],
59+
yticklabels=['Data\ngroup', 'Data\nclass'], xlabel="Sample index")
60+
61+
62+
visualize_groups(y, groups, 'no groups')
63+
64+
###############################################################################
65+
# Define a function to visualize cross-validation behavior
66+
# --------------------------------------------------------
67+
#
68+
# We'll define a function that lets us visualize the behavior of each
69+
# cross-validation object. We'll perform 4 splits of the data. On each
70+
# split, we'll visualize the indices chosen for the training set
71+
# (in blue) and the test set (in red).
72+
73+
74+
def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
75+
"""Create a sample plot for indices of a cross-validation object."""
76+
77+
# Generate the training/testing visualizations for each CV split
78+
for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
79+
# Fill in indices with the training/test groups
80+
indices = np.array([np.nan] * len(X))
81+
indices[tt] = 1
82+
indices[tr] = 0
83+
84+
# Visualize the results
85+
ax.scatter(range(len(indices)), [ii + .5] * len(indices),
86+
c=indices, marker='_', lw=lw, cmap=cmap_cv,
87+
vmin=-.2, vmax=1.2)
88+
89+
# Plot the data classes and groups at the end
90+
ax.scatter(range(len(X)), [ii + 1.5] * len(X),
91+
c=y, marker='_', lw=lw, cmap=cmap_data)
92+
93+
ax.scatter(range(len(X)), [ii + 2.5] * len(X),
94+
c=group, marker='_', lw=lw, cmap=cmap_data)
95+
96+
# Formatting
97+
yticklabels = list(range(n_splits)) + ['class', 'group']
98+
ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
99+
xlabel='Sample index', ylabel="CV iteration",
100+
ylim=[n_splits+2.2, -.2], xlim=[0, 100])
101+
ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
102+
return ax
103+
104+
105+
###############################################################################
106+
# Let's see how it looks for the `KFold` cross-validation object:
107+
108+
fig, ax = plt.subplots()
109+
cv = KFold(n_splits)
110+
plot_cv_indices(cv, X, y, groups, ax, n_splits)
111+
112+
###############################################################################
113+
# As you can see, by default the KFold cross-validation iterator does not
114+
# take either datapoint class or group into consideration. We can change this
115+
# by using the ``StratifiedKFold`` like so.
116+
117+
fig, ax = plt.subplots()
118+
cv = StratifiedKFold(n_splits)
119+
plot_cv_indices(cv, X, y, groups, ax, n_splits)
120+
121+
###############################################################################
122+
# In this case, the cross-validation retained the same ratio of classes across
123+
# each CV split. Next we'll visualize this behavior for a number of CV
124+
# iterators.
125+
#
126+
# Visualize cross-validation indices for many CV objects
127+
# ------------------------------------------------------
128+
#
129+
# Let's visually compare the cross validation behavior for many
130+
# scikit-learn cross-validation objects. Below we will loop through several
131+
# common cross-validation objects, visualizing the behavior of each.
132+
#
133+
# Note how some use the group/class information while others do not.
134+
135+
cvs = [KFold, GroupKFold, ShuffleSplit, StratifiedKFold,
136+
GroupShuffleSplit, StratifiedShuffleSplit, TimeSeriesSplit]
137+
138+
139+
for cv in cvs:
140+
this_cv = cv(n_splits=n_splits)
141+
fig, ax = plt.subplots(figsize=(6, 3))
142+
plot_cv_indices(this_cv, X, y, groups, ax, n_splits)
143+
144+
ax.legend([Patch(color=cmap_cv(.8)), Patch(color=cmap_cv(.02))],
145+
['Testing set', 'Training set'], loc=(1.02, .8))
146+
# Make the legend fit
147+
plt.tight_layout()
148+
fig.subplots_adjust(right=.7)
149+
plt.show()

sklearn/__init__.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@
6262
else:
6363
from . import __check_build
6464
from .base import clone
65+
from .utils._show_versions import show_versions
66+
6567
__check_build # avoid flakes unused variable error
6668

6769
__all__ = ['calibration', 'cluster', 'covariance', 'cross_decomposition',
@@ -74,7 +76,8 @@
7476
'preprocessing', 'random_projection', 'semi_supervised',
7577
'svm', 'tree', 'discriminant_analysis', 'impute', 'compose',
7678
# Non-modules:
77-
'clone', 'get_config', 'set_config', 'config_context']
79+
'clone', 'get_config', 'set_config', 'config_context',
80+
'show_versions']
7881

7982

8083
def setup_module(module):

sklearn/cluster/birch.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ def _split_node(node, threshold, branching_factor):
7474

7575
farthest_idx = np.unravel_index(
7676
dist.argmax(), (n_clusters, n_clusters))
77-
node1_dist, node2_dist = dist[[farthest_idx]]
77+
node1_dist, node2_dist = dist[(farthest_idx,)]
7878

7979
node1_closer = node1_dist < node2_dist
8080
for idx, subcluster in enumerate(node.subclusters_):

sklearn/feature_extraction/tests/test_image.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -304,9 +304,9 @@ def test_extract_patches_strided():
304304
ndim = len(image_shape)
305305

306306
assert_true(patches.shape[:ndim] == expected_view)
307-
last_patch_slices = [slice(i, i + j, None) for i, j in
308-
zip(last_patch, patch_size)]
309-
assert_true((patches[[slice(-1, None, None)] * ndim] ==
307+
last_patch_slices = tuple(slice(i, i + j, None) for i, j in
308+
zip(last_patch, patch_size))
309+
assert_true((patches[(-1, None, None) * ndim] ==
310310
image[last_patch_slices].squeeze()).all())
311311

312312

0 commit comments

Comments
 (0)
0