8000 [WIP] 0.22.2 release branch (#16587) · sstalley/scikit-learn@4b7331e · GitHub
[go: up one dir, main page]

Skip to content

Commit 4b7331e

Browse files
jeremiedbbalexshackedoleksandr-pavlykogriselglemaitre
authored
[WIP] 0.22.2 release branch (scikit-learn#16587)
* FIX ensure object array are properly casted when dtype=object (scikit-learn#16076) * DOC Docstring example of classifier should import classifier (scikit-learn#16430) * MNT Update nightly build URL and release staging config (scikit-learn#16435) * BUG ensure that estimator_name is properly stored in the ROC display (scikit-learn#16500) * BUG ensure that name is properly stored in the precision/recall display (scikit-learn#16505) * ENH Perform KNN imputation without O(n^2) memory cost (scikit-learn#16397) * bump scikit-learn version for binder * bump version to 0.22.2 * MNT Skips failing SpectralCoclustering doctest (scikit-learn#16232) * TST Updates test for deprecation in pandas.SparseArray (scikit-learn#16040) * move 0.22.2 what's new entries (scikit-learn#16586) * add 0.22.2 in the news of the web site frontpage * skip test_ard_accuracy_on_easy_problem Co-authored-by: alexshacked <al.shacked@gmail.com> Co-authored-by: Oleksandr Pavlyk <oleksandr-pavlyk@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
1 parent b194674 commit 4b7331e

File tree

23 files changed

+296
-91
lines changed

23 files changed

+296
-91
lines changed

.binder/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@ scikit-image==0.16.2
55
pandas==0.25.3
66
sphinx-gallery==0.5.0
77
# Need to update the scikit-learn version on each 0.22 minor release
8-
scikit-learn==0.22
8+
scikit-learn==0.22.2
99

doc/developers/advanced_installation.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Installing a nightly build is the quickest way to:
2626

2727
::
2828

29-
pip install --pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn
29+
pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn
3030

3131

3232
.. _install_bleeding_edge:

doc/templates/index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,7 @@ <h4 class="sk-landing-call-header">News</h4>
156156
<li><strong>On-going development:</strong>
157157
<a href="https://scikit-learn.org/dev/whats_new.html"><strong>What's new</strong> (Changelog)</a>
158158
</li>
159+
<li><strong>February 2020.</strong> scikit-learn 0.22.2 is available for download (<a href="whats_new/v0.22.html#version-0-22-2">Changelog</a>).
159160
<li><strong>January 2020.</strong> scikit-learn 0.22.1 is available for download (<a href="whats_new/v0.22.html#version-0-22-1">Changelog</a>).
160161
<li><strong>December 2019.</strong> scikit-learn 0.22 is available for download (<a href="whats_new/v0.22.html#version-0-22-0">Changelog</a>).
161162
</li>

doc/whats_new/v0.22.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,48 @@
22

33
.. currentmodule:: sklearn
44

5+
.. _changes_0_22_2:
6+
7+
Version 0.22.2
8+
==============
9+
10+
**February 28 2020**
11+
12+
Changelog
13+
---------
14+
15+
:mod:`sklearn.impute`
16+
.....................
17+
18+
- |Efficiency| Reduce :func:`impute.KNNImputer` asymptotic memory usage by
19+
chunking pairwise distance computation.
20+
:pr:`16397` by `Joel Nothman`_.
21+
22+
:mod:`sklearn.metrics`
23+
......................
24+
25+
- |Fix| Fixed a bug in :func:`metrics.plot_roc_curve` where
26+
the name of the estimator was passed in the :class:`metrics.RocCurveDisplay`
27+
instead of the parameter `name`. It results in a different plot when calling
28+
:meth:`metrics.RocCurveDisplay.plot` for the subsequent times.
29+
:pr:`16500` by :user:`Guillaume Lemaitre <glemaitre>`.
30+
31+
- |Fix| Fixed a bug in :func:`metrics.plot_precision_recall_curve` where the
32+
name of the estimator was passed in the
33+
:class:`metrics.PrecisionRecallDisplay` instead of the parameter `name`. It
34+
results in a different plot when calling
35+
:meth:`metrics.PrecisionRecallDisplay.plot` for the subsequent times.
36+
:pr:`#16505` by :user:`Guillaume Lemaitre <glemaitre>`.
37+
38+
:mod:`sklearn.neighbors`
39+
..............................
40+
41+
- |Fix| Fix a bug which converted a list of arrays into a 2-D object
42+
array instead of a 1-D array containing NumPy arrays. This bug
43+
was affecting :meth:`neighbors.NearestNeighbors.radius_neighbors`.
44+
:pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and
45+
:user:`Alex Shacked <alexshacked>`.
46+
547
.. _changes_0_22_1:
648

749
Version 0.22.1

setup.cfg

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ filterwarnings =
2020

2121
[wheelhouse_uploader]
2222
artifact_indexes=
23-
# Wheels built by travis (only for specific tags):
23+
# Wheels built by Azure Pipelines (only for specific tags):
2424
# https://github.com/MacPython/scikit-learn-wheels
25-
http://wheels.scipy.org
25+
https://pypi.anaconda.org/scikit-learn-wheels-staging/simple/scikit-learn/
2626

2727
[flake8]
2828
# Default flake8 3.5 ignored flags

sklearn/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040
# Dev branch marker is: 'X.Y.dev' or 'X.Y.devN' where N is an integer.
4141
# 'X.Y.dev0' is the canonical version of 'X.Y.dev'
4242
#
43-
__version__ = '0.22.1'
43+
__version__ = '0.22.2'
4444

4545

4646
# On OSX, we can get a runtime error due to multiple OpenMP libraries loaded

sklearn/cluster/_bicluster.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -260,9 +260,9 @@ class SpectralCoclustering(BaseSpectral):
260260
>>> X = np.array([[1, 1], [2, 1], [1, 0],
261261
... [4, 7], [3, 5], [3, 6]])
262262
>>> clustering = SpectralCoclustering(n_clusters=2, random_state=0).fit(X)
263-
>>> clustering.row_labels_
263+
>>> clustering.row_labels_ #doctest: +SKIP
264264
array([0, 1, 1, 0, 0, 0], dtype=int32)
265-
>>> clustering.column_labels_
265+
>>> clustering.column_labels_ #doctest: +SKIP
266266
array([0, 0], dtype=int32)
267267
>>> clustering
268268
SpectralCoclustering(n_clusters=2, random_state=0)

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -952,7 +952,7 @@ class HistGradientBoostingClassifier(BaseHistGradientBoosting,
952952
--------
953953
>>> # To use this experimental feature, we need to explicitly ask for it:
954954
>>> from sklearn.experimental import enable_hist_gradient_boosting # noqa
955-
>>> from sklearn.ensemble import HistGradientBoostingRegressor
955+
>>> from sklearn.ensemble import HistGradientBoostingClassifier
956956
>>> from sklearn.datasets import load_iris
957957
>>> X, y = load_iris(return_X_y=True)
958958
>>> clf = HistGradientBoostingClassifier().fit(X, y)

sklearn/impute/_knn.py

Lines changed: 61 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from ._base import _BaseImputer
88
from ..utils.validation import FLOAT_DTYPES
9-
from ..metrics import pairwise_distances
9+
from ..metrics import pairwise_distances_chunked
1010
from ..metrics.pairwise import _NAN_METRICS
1111
from ..neighbors._base import _get_weights
1212
from ..neighbors._base import _check_weights
@@ -217,71 +217,81 @@ def transform(self, X):
217217

218218
mask = _get_mask(X, self.missing_values)
219219
mask_fit_X = self._mask_fit_X
220+
valid_mask = ~np.all(mask_fit_X, axis=0)
220221

221-
# Removes columns where the training data is all nan
222222
if not np.any(mask):
223-
valid_mask = ~np.all(mask_fit_X, axis=0)
223+
# No missing values in X
224+
# Remove columns where the training data is all nan
224225
return X[:, valid_mask]
225226

226227
row_missing_idx = np.flatnonzero(mask.any(axis=1))
227228

228-
# Pairwise distances between receivers and fitted samples
229-
dist = pairwise_distances(X[row_missing_idx, :], self._fit_X,
230-
metric=self.metric,
231-
missing_values=self.missing_values,
232-
force_all_finite=force_all_finite)
229+
non_missing_fix_X = np.logical_not(mask_fit_X)
233230

234231
# Maps from indices from X to indices in dist matrix
235232
dist_idx_map = np.zeros(X.shape[0], dtype=np.int)
236233
dist_idx_map[row_missing_idx] = np.arange(row_missing_idx.shape[0])
237234

238-
non_missing_fix_X = np.logical_not(mask_fit_X)
239-
240-
# Find and impute missing
241-
valid_idx = []
242-
for col in range(X.shape[1]):
243-
244-
potential_donors_idx = np.flatnonzero(non_missing_fix_X[:, col])
245-
246-
# column was all missing during training
247-
if len(potential_donors_idx) == 0:
248-
continue
249-
250-
# column has no missing values
251-
if not np.any(mask[:, col]):
252-
valid_idx.append(col)
253-
continue
235+
def process_chunk(dist_chunk, start):
236+
row_missing_chunk = row_missing_idx[start:start + len(dist_chunk)]
254237

255-
valid_idx.append(col)
256-
257-
receivers_idx = np.flatnonzero(mask[:, col])
258-
259-
# distances for samples that needed imputation for column
260-
dist_subset = (dist[dist_idx_map[receivers_idx]]
261-
[:, potential_donors_idx])
238+
# Find and impute missing by column
239+
for col in range(X.shape[1]):
240+
if not valid_mask[col]:
241+
# column was all missing during training
242+
continue
262243

263-
# receivers with all nan distances impute with mean
264-
all_nan_dist_mask = np.isnan(dist_subset).all(axis=1)
265-
all_nan_receivers_idx = receivers_idx[all_nan_dist_mask]
244+
col_mask = mask[row_missing_chunk, col]
245+
if not np.any(col_mask):
246+
# column has no missing values
247+
continue
266248

267-
if all_nan_receivers_idx.size:
268-
col_mean = np.ma.array(self._fit_X[:, col],
269-
mask=mask_fit_X[:, col]).mean()
270-
X[all_nan_receivers_idx, col] = col_mean
249+
potential_donors_idx, = np.nonzero(non_missing_fix_X[:, col])
271250

272-
if len(all_nan_receivers_idx) == len(receivers_idx):
273-
# all receivers imputed with mean
274-
continue
251+
# receivers_idx are indices in X
252+
receivers_idx = row_missing_chunk[np.flatnonzero(col_mask)]
275253

276-
# receivers with at least one defined distance
277-
receivers_idx = receivers_idx[~all_nan_dist_mask]
278-
dist_subset = (dist[dist_idx_map[receivers_idx]]
254+
# distances for samples that needed imputation for column
255+
dist_subset = (dist_chunk[dist_idx_map[receivers_idx] - start]
279256
[:, potential_donors_idx])
280257

281-
n_neighbors = min(self.n_neighbors, len(potential_donors_idx))
282-
value = self._calc_impute(dist_subset, n_neighbors,
283-
self._fit_X[potential_donors_idx, col],
284-
mask_fit_X[potential_donors_idx, col])
285-
X[receivers_idx, col] = value
286-
287-
return super()._concatenate_indicator(X[:, valid_idx], X_indicator)
258+
# receivers with all nan distances impute with mean
259+
all_nan_dist_mask = np.isnan(dist_subset).all(axis=1)
260+
all_nan_receivers_idx = receivers_idx[all_nan_dist_mask]
261+
262+
if all_nan_receivers_idx.size:
263+
col_mean = np.ma.array(self._fit_X[:, col],
264+
mask=mask_fit_X[:, col]).mean()
265+
X[all_nan_receivers_idx, col] = col_mean
266+
267+
if len(all_nan_receivers_idx) == len(receivers_idx):
268+
# all receivers imputed with mean
269+
continue
270+
271+
# receivers with at least one defined distance
272+
receivers_idx = receivers_idx[~all_nan_dist_mask]
273+
dist_subset = (dist_chunk[dist_idx_map[receivers_idx]
274+
- start]
275+
[:, potential_donors_idx])
276+
277+
n_neighbors = min(self.n_neighbors, len(potential_donors_idx))
278+
value = self._calc_impute(
279+
dist_subset,
280+
n_neighbors,
281+
self._fit_X[potential_donors_idx, col],
282+
mask_fit_X[potential_donors_idx, col])
283+
X[receivers_idx, col] = value
284+
285+
# process in fixed-memory chunks
286+
gen = pairwise_distances_chunked(
287+
X[row_missing_idx, :],
288+
self._fit_X,
289+
metric=self.metric,
290+
missing_values=self.missing_values,
291+
force_all_finite=force_all_finite,
292+
reduce_func=process_chunk)
293+
for chunk in gen:
294+
# process_chunk modifies X in place. No return value.
295+
pass
296+
297+
return super()._concatenate_indicator(X[:, valid_mask], X_indicator)

sklearn/impute/tests/test_knn.py

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import numpy as np
22
import pytest
33

4+
from sklearn import config_context
45
from sklearn.impute import KNNImputer
56
from sklearn.metrics.pairwise import nan_euclidean_distances
67
from sklearn.metrics.pairwise import pairwise_distances
@@ -522,8 +523,12 @@ def custom_callable(x, y, missing_values=np.nan, squared=False):
522523
assert_allclose(imputer.fit_transform(X), X_imputed)
523524

524525

526+
@pytest.mark.parametrize("working_memory", [None, 0])
525527
@pytest.mark.parametrize("na", [-1, np.nan])
526-
def test_knn_imputer_with_simple_example(na):
528+
# Note that we use working_memory=0 to ensure that chunking is tested, even
529+
# for a small dataset. However, it should raise a UserWarning that we ignore.
530+
@pytest.mark.filterwarnings("ignore:adhere to working_memory")
531+
def test_knn_imputer_with_simple_example(na, working_memory):
527532

528533
X = np.array([
529534
[0, na, 0, na],
@@ -553,8 +558,9 @@ def test_knn_imputer_with_simple_example(na):
553558
[r7c0, 7, 7, 7]
554559
])
555560

556-
imputer_comp = KNNImputer(missing_values=na)
557-
assert_allclose(imputer_comp.fit_transform(X), X_imputed)
561+
with config_context(working_memory=working_memory):
562+
imputer_comp = KNNImputer(missing_values=na)
563+
assert_allclose(imputer_comp.fit_transform(X), X_imputed)
558564

559565

560566
@pytest.mark.parametrize("na", [-1, np.nan])
@@ -598,8 +604,10 @@ def test_knn_imputer_drops_all_nan_features(na):
598604
assert_allclose(knn.transform(X2), X2_expected)
599605

600606

607+
@pytest.mark.parametrize("working_memory", [None, 0])
601608
@pytest.mark.parametrize("na", [-1, np.nan])
602-
def test_knn_imputer_distance_weighted_not_enough_neighbors(na):
609+
def test_knn_imputer_distance_weighted_not_enough_neighbors(na,
610+
working_memory):
603611
X = np.array([
604612
[3, na],
605613
[2, na],
@@ -626,11 +634,14 @@ def test_knn_imputer_distance_weighted_not_enough_neighbors(na):
626634
[X_50, 5]
627635
])
628636

629-
knn_3 = KNNImputer(missing_values=na, n_neighbors=3, weights='distance')
630-
assert_allclose(knn_3.fit_transform(X), X_expected)
637+
with config_context(working_memory=working_memory):
638+
knn_3 = KNNImputer(missing_values=na, n_neighbors=3,
639+
weights='distance')
640+
assert_allclose(knn_3.fit_transform(X), X_expected)
631641

632-
knn_4 = KNNImputer(missing_values=na, n_neighbors=4, weights='distance')
633-
assert_allclose(knn_4.fit_transform(X), X_expected)
642+
knn_4 = KNNImputer(missing_values=na, n_neighbors=4,
643+
weights='distance')
644+
assert_allclose(knn_4.fit_transform(X), X_expected)
634645

635646

636647
@pytest.mark.parametrize("na, allow_nan", [(-1, False), (np.nan, True)])

0 commit comments

Comments
 (0)
0