10000 Merge remote-tracking branch 'upstream/master' into test_btn · thoo/scikit-learn@0e4e517 · GitHub
[go: up one dir, main page]

Skip to content

Commit 0e4e517

Browse files
committed
Merge remote-tracking branch 'upstream/master' into test_btn
* upstream/master: joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
2 parents e9ea893 + c81e255 commit 0e4e517

39 files changed

+679
-362
lines changed

.travis.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,12 @@ matrix:
3838
NUMPY_VERSION="1.10.4" SCIPY_VERSION="0.16.1" CYTHON_VERSION="0.25.2"
3939
PILLOW_VERSION="4.0.0" COVERAGE=true
4040
if: type != cron
41+
# Python 3.5 build
42+
- env: DISTRIB="conda" PYTHON_VERSION="3.5" INSTALL_MKL="false"
43+
NUMPY_VERSION="1.10.4" SCIPY_VERSION="0.16.1" CYTHON_VERSION="0.25.2"
44+
PILLOW_VERSION="4.0.0" COVERAGE=true
45+
SKLEARN_SITE_JOBLIB=1 JOBLIB_VERSION="0.11"
46+
if: type != cron
4147
# This environment tests the latest available dependencies.
4248
# It runs tests requiring pandas and PyAMG.
4349
# It also runs with the site joblib instead of the vendored copy of joblib.

doc/glossary.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ General Concepts
226226

227227
We try to adhere to `PEP257
228228
<https://www.python.org/dev/peps/pep-0257/>`_, and follow `NumpyDoc
229-
conventions <numpydoc.readthedocs.io/en/latest/format.html>`_.
229+
conventions <https://numpydoc.readthedocs.io/en/latest/format.html>`_.
230230

231231
double underscore
232232
double underscore notation

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1400,6 +1400,7 @@ Low-level methods
14001400
:template: function.rst
14011401

14021402
tree.export_graphviz
1403+
tree.plot_tree
14031404

14041405

14051406
.. _utils_ref:

doc/modules/computing.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -567,9 +567,9 @@ These environment variables should be set before importing scikit-learn.
567567
scikit-learn uses the site joblib rather than its vendored version.
568568
Consequently, joblib must be installed for scikit-learn to run.
569569
Note that using the site joblib is at your own risks: the versions of
570-
scikt-learn and joblib need to be compatible. In addition, dumps from
571-
joblib.Memory might be incompatible, and you might loose some caches
572-
and have to redownload some datasets.
570+
scikit-learn and joblib need to be compatible. Currently, joblib 0.11+
571+
is supported. In addition, dumps from joblib.Memory might be incompatible,
572+
and you might loose some caches and have to redownload some datasets.
573573

574574
:SKLEARN_ASSUME_FINITE:
575575

doc/whats_new/v0.20.rst

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -134,14 +134,20 @@ Changelog
134134
:mod:`sklearn.preprocessing`
135135
........................
136136

137+
- |Fix| Fixed bug in :class:`preprocessing.OrdinalEncoder` when passing
138+
manually specified categories. :issue:`12365` by `Joris Van den Bossche`_.
139+
140+
- |Fix| Fixed bug in :class:`preprocessing.KBinsDiscretizer` where the
141+
``transform`` method mutates the ``_encoder`` attribute. The ``transform``
142+
method is now thread safe. :issue:`12514` by
143+
:user:`Hanmin Qin <qinhanmin2014>`.
144+
137145
- |API| The default value of the :code:`method` argument in
138146
:func:`preprocessing.power_transform` will be changed from :code:`box-cox`
139147
to :code:`yeo-johnson` to match :class:`preprocessing.PowerTransformer`
140148
in version 0.23. A FutureWarning is raised when the default value is used.
141149
:issue:`12317` by :user:`Eric Chang <chang>`.
142-
143-
- |Fix| Fixed bug in :class:`preprocessing.OrdinalEncoder` when passing
144-
manually specified categories. :issue:`12365` by `Joris Van den Bossche`_.
150+
145151

146152
:mod:`sklearn.utils`
147153
........................
@@ -150,6 +156,13 @@ Changelog
150156
precision issues in :class:`preprocessing.StandardScaler` and
151157
:class:`decomposition.IncrementalPCA` when using float32 datasets.
152158
:issue:`12338` by :user:`bauks <bauks>`.
159+
160+
Miscellaneous
161+
.............
162+
163+
- |Fix| When using site joblib by setting the environment variable
164+
`SKLEARN_SITE_JOBLIB`, added compatibility with joblib 0.11 in addition
165+
to 0.12+. :issue:`12350` by `Joel Nothman`_ and `Roman Yurchak`_.
153166

154167
Miscellaneous
155168
.............
@@ -1309,6 +1322,9 @@ Miscellaneous
13091322
happens immediately (i.e., without a deprecation cycle).
13101323
:issue:`11741` by `Olivier Grisel`_.
13111324

1325+
- |Fix| Fixed a bug in validation helpers where passing a Dask DataFrame results
1326+
in an error. :issue:`12462` by :user:`Zachariah Miller <zwmiller>`
1327+
13121328
Changes to estimator checks
13131329
---------------------------
13141330

doc/whats_new/v0.21.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ Support for Python 3.4 and below has been officially dropped.
106106
:mod:`sklearn.tree`
107107
...................
108108
- Decision Trees can now be plotted with matplotlib using
109-
:func:`tree.export.plot_tree` without relying on the ``dot`` library,
109+
:func:`tree.plot_tree` without relying on the ``dot`` library,
110110
removing a hard-to-install dependency. :issue:`8508` by `Andreas Müller`_.
111111

112112
- |Feature| ``get_n_leaves()`` and ``get_depth()`` have been added to
@@ -127,5 +127,5 @@ These changes mostly affect library developers.
127127
- Add ``check_fit_idempotent`` to
128128
:func:`~utils.estimator_checks.check_estimator`, which checks that
129129
when `fit` is called twice with the same data, the ouput of
130-
`predit`, `predict_proba`, `transform`, and `decision_function` does not
130+
`predict`, `predict_proba`, `transform`, and `decision_function` does not
131131
change. :issue:`12328` by :user:`Nicolas Hug <NicolasHug>`

sklearn/cluster/hierarchical.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,6 +230,7 @@ def ward_tree(X, connectivity=None, n_clusters=None, return_distance=False):
230230
'retain the lower branches required '
231231
'for the specified number of clusters',
232232
stacklevel=2)
233+
X = np.require(X, requirements="W")
233234
out = hierarchy.ward(X)
234235
children_ = out[:, :2].astype(np.intp)
235236

sklearn/cluster/k_means_.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -850,7 +850,9 @@ class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
850850
Attributes
851851
----------
852852
cluster_centers_ : array, [n_clusters, n_features]
853-
Coordinates of cluster centers
853+
Coordinates of cluster centers. If the algorithm stops before fully
854+
converging (see ``tol`` and ``max_iter``), these will not be
855+
consistent with ``labels_``.
854856
855857
labels_ :
856858
Labels of each point
@@ -901,11 +903,12 @@ class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
901903
clustering algorithms available), but it falls in local minima. That's why
902904
it can be useful to restart it several times.
903905
904-
If the algorithm stops before fully converging (because of ``tol`` of
905-
``max_iter``), ``labels_`` and ``means_`` will not be consistent, i.e. the
906-
``means_`` will not be the means of the points in each cluster.
907-
Also, the estimator will reassign ``labels_`` after the last iteration to
908-
make ``labels_`` consistent with ``predict`` on the training set.
906+
If the algorithm stops before fully converging (because of ``tol`` or
907+
``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
908+
i.e. the ``cluster_centers_`` will not be the means of the points in each
909+
cluster. Also, the estimator will reassign ``labels_`` after the last
910+
iteration to make ``labels_`` consistent with ``predict`` on the training
911+
set.
909912
910913
"""
911914

sklearn/cluster/optics_.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ def optics(X, min_samples=5, max_eps=np.inf, metric='minkowski',
2626
p=2, metric_params=None, maxima_ratio=.75,
2727
rejection_ratio=.7, similarity_threshold=0.4,
2828
significant_min=.003, min_cluster_size=.005,
29-
min_maxima_ratio=0.001, algorithm='ball_tree',
29+
min_maxima_ratio=0.001, algorithm='auto',
3030
leaf_size=30, n_jobs=None):
3131
"""Perform OPTICS clustering from vector array
3232
@@ -133,11 +133,11 @@ def optics(X, min_samples=5, max_eps=np.inf, metric='minkowski',
133133
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
134134
Algorithm used to compute the nearest neighbors:
135135
136-
- 'ball_tree' will use :class:`BallTree` (default)
136+
- 'ball_tree' will use :class:`BallTree`
137137
- 'kd_tree' will use :class:`KDTree`
138138
- 'brute' will use a brute-force search.
139139
- 'auto' will attempt to decide the most appropriate algorithm
140-
based on the values passed to :meth:`fit` method.
140+
based on the values passed to :meth:`fit` method. (default)
141141
142142
Note: fitting on sparse input will override the setting of
143143
this parameter, using brute force.
@@ -289,11 +289,11 @@ class OPTICS(BaseEstimator, ClusterMixin):
289289
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
290290
Algorithm used to compute the nearest neighbors:
291291
292-
- 'ball_tree' will use :class:`BallTree` (default)
292+
- 'ball_tree' will use :class:`BallTree`
293293
- 'kd_tree' will use :class:`KDTree`
294294
- 'brute' will use a brute-force search.
295295
- 'auto' will attempt to decide the most appropriate algorithm
296-
based on the values passed to :meth:`fit` method.
296+
based on the values passed to :meth:`fit` method. (default)
297297
298298
Note: fitting on sparse input will override the setting of
299299
this parameter, using brute force.
@@ -357,7 +357,7 @@ def __init__(self, min_samples=5, max_eps=np.inf, metric='minkowski',
357357
p=2, metric_params=None, maxima_ratio=.75,
358358
rejection_ratio=.7, similarity_threshold=0.4,
359359
significant_min=.003, min_cluster_size=.005,
360-
min_maxima_ratio=0.001, algorithm='ball_tree',
360+
min_maxima_ratio=0.001, algorithm='auto',
361361
leaf_size=30, n_jobs=None):
362362

363363
self.max_eps = max_eps

sklearn/ensemble/forest.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,6 @@ class calls the ``fit`` method of each sub-estimator on random samples
4949
from scipy.sparse import issparse
5050
from scipy.sparse import hstack as sparse_hstack
5151

52-
5352
from ..base import ClassifierMixin, RegressorMixin
5453
from ..utils import Parallel, delayed
5554
from ..externals import six
@@ -61,7 +60,7 @@ class calls the ``fit`` method of each sub-estimator on random samples
6160
from ..utils import check_random_state, check_array, compute_sample_weight
6261
from ..exceptions import DataConversionWarning, NotFittedError
6362
from .base import BaseEnsemble, _partition_estimators
64-
from ..utils.fixes import parallel_helper
63+
from ..utils.fixes import parallel_helper, _joblib_parallel_args
6564
from ..utils.multiclass import check_classification_targets
6665
from ..utils.validation import check_is_fitted
6766

@@ -174,7 +173,7 @@ def apply(self, X):
174173
"""
175174
X = self._validate_X_predict(X)
176175
results = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
177-
prefer="threads")(
176+
**_joblib_parallel_args(prefer="threads"))(
178177
delayed(parallel_helper)(tree, 'apply', X, check_input=False)
179178
for tree in self.estimators_)
180179

@@ -205,7 +204,7 @@ def decision_path(self, X):
205204
"""
206205
X = self._validate_X_predict(X)
207206
indicators = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
208-
prefer="threads")(
207+
**_joblib_parallel_args(prefer='threads'))(
209208
delayed(parallel_helper)(tree, 'decision_path', X,
210209
check_input=False)
211210
for tree in self.estimators_)
@@ -323,11 +322,11 @@ def fit(self, X, y, sample_weight=None):
323322
# Parallel loop: we prefer the threading backend as the Cython code
324323
# for fitting the trees is internally releasing the Python GIL
325324
# making threading more efficient than multiprocessing in
326-
# that case. However, we respect any parallel_backend contexts set
327-
# at a higher level, since correctness does not rely on using
328-
# threads.
325+
# that case. However, for joblib 0.12+ we respect any
326+
# parallel_backend contexts set at a higher level,
327+
# since correctness does not rely on using threads.
329328
trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
330-
prefer="threads")(
329+
**_joblib_parallel_args(prefer='threads'))(
331330
delayed(_parallel_build_trees)(
332331
t, self, X, y, sample_weight, i, len(trees),
333332
verbose=self.verbose, class_weight=self.class_weight)
@@ -374,7 +373,7 @@ def feature_importances_(self):
374373
check_is_fitted(self, 'estimators_')
375374

376375
all_importances = Parallel(n_jobs=self.n_jobs,
377-
prefer="threads")(
376+
**_joblib_parallel_args(prefer='threads'))(
378377
delayed(getattr)(tree, 'feature_importances_')
379378
for tree in self.estimators_)
380379

@@ -590,7 +589,8 @@ class in a leaf.
590589
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
591590
for j in np.atleast_1d(self.n_classes_)]
592591
lock = threading.Lock()
593-
Parallel(n_jobs=n_jobs, verbose=self.verbose, require="sharedmem")(
592+
Parallel(n_jobs=n_jobs, verbose=self.verbose,
593+
**_joblib_parallel_args(require="sharedmem"))(
594594
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
595595
lock)
596596
for e in self.estimators_)
@@ -698,7 +698,8 @@ def predict(self, X):
698698

699699
# Parallel loop
700700
lock = threading.Lock()
701-
Parallel(n_jobs=n_jobs, verbose=self.verbose, require="sharedmem")(
701+
Parallel(n_jobs=n_jobs, verbose=self.verbose,
702+
**_joblib_parallel_args(require="sharedmem"))(
702703
delayed(_accumulate_prediction)(e.predict, X, [y_hat], lock)
703704
for e in self.estimators_)
704705

0 commit comments

Comments
 (0)
0