8000 ENH Major to Minor incremental enhancements to the model_selection · raghavrv/scikit-learn@19c2a78 · GitHub
[go: up one dir, main page]

Skip to content
8000
< 8000 div class="prc-PageLayout-PageLayoutWrapper-s2ao4" data-width="full">

Commit 19c2a78

Browse files
committed
ENH Major to Minor incremental enhancements to the model_selection
Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc
1 parent ed8728a commit 19c2a78

24 files changed

+861
-417
lines changed

sklearn/cross_validation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
"model_selection module into which all the refactored classes "
3838
"and functions are moved. Also note that the interface of the "
3939
"new CV iterators are different from that of this module. "
40-
"Refer to model_selection for more info.", DeprecationWarning)
40+
"This module will be removed in 0.19.", DeprecationWarning)
4141

4242

4343
__all__ = ['KFold',

sklearn/feature_selection/rfe.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
from ..base import clone
1616
from ..base import is_classifier
1717
from ..model_selection import check_cv
18-
from ..model_selection.validate import _safe_split, _score
18+
from ..model_selection._validation import _safe_split, _score
1919
from ..metrics.scorer import check_scoring
2020
from .base import SelectorMixin
2121

@@ -447,7 +447,7 @@ def fit(self, X, y):
447447
self.estimator_.set_params(**self.estimator_params)
448448
self.estimator_.fit(self.transform(X), y)
449449

450-
# Fixing a normalization error, n is equal to len_cv - 1
451-
# here, the scores are normalized by len_cv
452-
self.grid_scores_ = scores / cv.n_splits(X, y)
450+
# Fixing a normalization error, n is equal to get_n_splits(X, y) - 1
451+
# here, the scores are normalized by get_n_splits(X, y)
452+
self.grid_scores_ = scores / cv.get_n_splits(X, y)
453453
return self

sklearn/grid_search.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,8 @@
3838

3939
warnings.warn("This module has been deprecated in favor of the "
4040
"model_selection module into which all the refactored classes "
41-
"and functions are moved.", DeprecationWarning)
41+
"and functions are moved. This module will be removed in 0.19.",
42+
DeprecationWarning)
4243

4344

4445
class ParameterGrid(object):

sklearn/learning_curve.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@
1818

1919

2020
warnings.warn("This module has been deprecated in favor of the "
21-
"model_selection module into which all the functions are moved.",
21+
"model_selection module into which all the functions are moved."
22+
" This module will be removed in 0.19",
2223
DeprecationWarning)
2324

2425

sklearn/linear_model/coordinate_descent.py

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1370,7 +1370,6 @@ class ElasticNetCV(LinearModelCV, RegressorMixin):
13701370
dual gap for optimality and continues until it is smaller
13711371
than ``tol``.
13721372
1373-
<<<<<<< HEAD
13741373
cv : int, cross-validation generator or an iterable, optional
13751374
Determines the cross-validation splitting strategy.
13761375
Possible inputs for cv are:
@@ -1383,13 +1382,6 @@ class ElasticNetCV(LinearModelCV, RegressorMixin):
13831382
13841383
Refer :ref:`User Guide <cross_validation>` for the various
13851384
cross-validation strategies that can be used here.
1386-
=======
1387-
cv : integer or cross-validation generator, optional
1388-
If an integer is passed, it is the number of fold (default 3).
1389-
Specific cross-validation objects can be passed, see the
1390-
:mod:`sklearn.model_selection.split` module for the list of
1391-
possible objects.
1392-
>>>>>>> ENH introduce the model_selection module
13931385
13941386
verbose : bool or integer
13951387
Amount of verbosity.
@@ -1860,7 +1852,6 @@ class MultiTaskElasticNetCV(LinearModelCV, RegressorMixin):
18601852
dual gap for optimality and continues until it is smaller
18611853
than ``tol``.
18621854
1863-
<<<<<<< HEAD
18641855
cv : int, cross-validation generator or an iterable, optional
18651856
Determines the cross-validation splitting strategy.
18661857
Possible inputs for cv are:
@@ -1873,13 +1864,6 @@ class MultiTaskElasticNetCV(LinearModelCV, RegressorMixin):
18731864
18741865
Refer :ref:`User Guide <cross_validation>` for the various
18751866
cross-validation strategies that can be used here.
1876-
=======
1877-
cv : integer or cross-validation generator, optional
1878-
If an integer is passed, it is the number of fold (default 3).
1879-
Specific cross-validation objects can be passed, see the
1880-
:mod:`sklearn.model_selection.split` module for the list of
1881-
possible objects.
1882-
>>>>>>> ENH introduce the model_selection module
18831867
18841868
verbose : bool or integer
18851869
Amount of verbosity.
@@ -2025,7 +2009,6 @@ class MultiTaskLassoCV(LinearModelCV, RegressorMixin):
20252009
dual gap for optimality and continues until it is smaller
20262010
than ``tol``.
20272011
2028-
<<<<<<< HEAD
20292012
cv : int, cross-validation generator or an iterable, optional
20302013
Determines the cross-validation splitting strategy.
20312014
Possible inputs for cv are:
@@ -2038,13 +2021,6 @@ class MultiTaskLassoCV(LinearModelCV, RegressorMixin):
20382021
20392022
Refer :ref:`User Guide <cross_validation>` for the various
20402023
cross-validation strategies that can be used here.
2041-
=======
2042-
cv : integer or cross-validation generator, optional
2043-
If an integer is passed, it is the number of fold (default 3).
2044-
Specific cross-validation objects can be passed, see the
2045-
:mod:`sklearn.model_selection.split` module for the list of
2046-
possible objects.
2047-
>>>>>>> ENH introduce the model_selection module
20482024
20492025
verbose : bool or integer
20502026
Amount of verbosity.

sklearn/linear_model/least_angle.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1086,7 +1086,7 @@ def fit(self, X, y):
10861086
method=self.method, verbose=max(0, self.verbose - 1),
10871087
normalize=self.normalize, fit_intercept=self.fit_intercept,
10881088
max_iter=self.max_iter, eps=self.eps, positive=self.positive)
1089-
for train, test in cv)
1089+
for train, test in cv.split(X, y))
10901090
all_alphas = np.concatenate(list(zip(*cv_paths))[0])
10911091
# Unique also sorts
10921092
all_alphas = np.unique(all_alphas)

sklearn/linear_model/logistic.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1308,7 +1308,7 @@ class LogisticRegressionCV(LogisticRegression, BaseEstimator,
13081308
cv : integer or cross-validation generator
13091309
The default cross-validation generator used is Stratified K-Folds.
13101310
If an integer is provided, then it is the number of folds used.
1311-
See the module :mod:`sklearn.model_selection.split` module for the
1311+
See the module :mod:`sklearn.model_selection` module for the
13121312
list of possible cross-validation objects.
13131313
13141314
penalty : str, 'l1' or 'l2'

sklearn/linear_model/tests/test_logistic.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
import numpy as n 179B p
32
import scipy.sparse as sp
43
from scipy import linalg, optimize, sparse
@@ -454,16 +453,24 @@ def test_ovr_multinomial_iris():
454453
train, target = iris.data, iris.target
455454
n_samples, n_features = train.shape
456455

457-
# Use pre-defined fold as folds generated for different y
456+
# The cv indices from stratified kfold (where stratification is done based
457+
# on the fine-grained iris classes, i.e, before the classes 0 and 1 are
458+
# conflated) is used for both clf and clf1
458459
cv = StratifiedKFold(3)
459-
clf = LogisticRegressionCV(cv=cv)
460+
precomputed_folds = list(cv.split(train, target))
461+
462+
# Train clf on the original dataset where classes 0 and 1 are separated
463+
clf = LogisticRegressionCV(cv=precomputed_folds)
460464
clf.fit(train, target)
461465

462-
clf1 = LogisticRegressionCV(cv=cv)
466+
# Conflate classes 0 and 1 and train clf1 on this modifed dataset
467+
clf1 = LogisticRegressionCV(cv=precomputed_folds)
463468
target_copy = target.copy()
464469
target_copy[target_copy == 0] = 1
465470
clf1.fit(train, target_copy)
466471

472+
# Ensure that what OvR learns for class2 is same regardless of whether
473+
# classes 0 and 1 are separated or not
467474
assert_array_almost_equal(clf.scores_[2], clf1.scores_[2])
468475
assert_array_almost_equal(clf.intercept_[2:], clf1.intercept_)
469476
assert_array_almost_equal(clf.coef_[2][np.newaxis, :], clf1.coef_)

sklearn/metrics/scorer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
arbitrary score functions.
55
66
A scorer object is a callable that can be passed to
7-
:class:`sklearn.model_selection.search.GridSearchCV` or
8-
:func:`sklearn.model_selection.validation.cross_val_score` as the ``scoring``
7+
:class:`sklearn.model_selection.GridSearchCV` or
8+
:func:`sklearn.model_selection.cross_val_score` as the ``scoring``
99
parameter, to specify how a model should be evaluated.
1010
1111
The signature of the call is ``(estimator, X, y)`` where ``estimator``

sklearn/model_selection/__init__.py

Lines changed: 32 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,35 @@
1-
from .split import KFold
2-
from .split import StratifiedKFold
3-
from .split import LeaveOneLabelOut
4-
from .split import LeaveOneOut
5-
from .split import LeavePLabelOut
6-
from .split import LeavePOut
7-
from .split import ShuffleSplit
8-
from .split import StratifiedShuffleSplit
9-
from .split import PredefinedSplit
10-
from .split import train_test_split
11-
from .split import check_cv
1+
from ._split import BaseCrossValidator
2+
from ._split import KFold
3+
from ._split import LabelKFold
4+
from ._split import StratifiedKFold
5+
from ._split import LeaveOneLabelOut
6+
from ._split import LeaveOneOut
7+
from ._split import LeavePLabelOut
8+
from ._split import LeavePOut
9+
from ._split import ShuffleSplit
10+
from ._split import LabelShuffleSplit
11+
from ._split import StratifiedShuffleSplit
12+
from ._split import PredefinedSplit
13+
from ._split import train_test_split
14+
from ._split import check_cv
1215

13-
from .validate import cross_val_score
14-
from .validate import cross_val_predict
15-
from .validate import learning_curve
16-
from .validate import permutation_test_score
17-
from .validate import validation_curve
16+
from ._validation import cross_val_score
17+
from ._validation import cross_val_predict
18+
from ._validation import learning_curve
19+
from ._validation import permutation_test_score
20+
from ._validation import validation_curve
1821

19-
from .search import GridSearchCV
20-
from .search import RandomizedSearchCV
21-
from .search import ParameterGrid
22-
from .search import ParameterSampler
23-
from .search import fit_grid_point
22+
from ._search import GridSearchCV
23+
from ._search import RandomizedSearchCV
24+
from ._search import ParameterGrid
25+
from ._search import ParameterSampler
26+
from ._search import fit_grid_point
2427

25-
__all__ = ('split',
26-
'validate',
27-
'search',
28-
'KFold',
29-
'StratifiedKFold',
30-
'LeaveOneLabelOut',
31-
'LeaveOneOut',
32-
'LeavePLabelOut',
33-
'LeavePOut',
34-
'ShuffleSplit',
35-
'StratifiedShuffleSplit',
36-
'PredefinedSplit',
37-
'train_test_split',
38-
'check_cv',
39-
'cross_val_score',
40-
'cross_val_predict',
41-
'permutation_test_score',
42-
'learning_curve',
43-
'validation_curve',
44-
'GridSearchCV',
45-
'ParameterGrid',
46-
'fit_grid_point',
47-
'ParameterSampler',
48-
'RandomizedSearchCV')
28+
__all__ = ('BaseCrossValidator', 'GridSearchCV', 'KFold', 'LabelKFold',
29+
'LeaveOneLabelOut', 'LeaveOneOut', 'LeavePLabelOut', 'LeavePOut',
30+
'ParameterGrid', 'ParameterSampler', 'PredefinedSplit',
31+
'RandomizedSearchCV', 'ShuffleSplit', 'LabelShuffleSplit',
32+
'StratifiedKFold', 'StratifiedShuffleSplit', 'check_cv',
33+
'cross_val_predict', 'cross_val_score', 'fit_grid_point',
34+
'learning_curve', 'permutation_test_score', 'train_test_split',
35+
'validation_curve')

sklearn/model_selection/search.py renamed to sklearn/model_selection/_search.py

Lines changed: 16 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@
2121

2222
from ..base import BaseEstimator, is_classifier, clone
2323
from ..base import MetaEstimatorMixin, ChangedBehaviorWarning
24-
from .split import check_cv
25-
from .validate import _fit_and_score
24+
from ._split import check_cv
25+
from ._validation import _fit_and_score
2626
from ..externals.joblib import Parallel, delayed
2727
from ..externals import six
2828
from ..utils import check_random_state
@@ -527,7 +527,7 @@ def _fit(self, X, y, labels, parameter_iterable):
527527
'of samples (%i) than data (X: %i samples)'
528528
% (len(y), n_samples))
529529
cv = check_cv(cv, y, classifier=is_classifier(estimator))
530-
len_cv = cv.n_splits(X, y, labels)
530+
len_cv = cv.get_n_splits(X, y, labels)
531531

532532
if self.verbose > 0:
533533
if isinstance(parameter_iterable, Sized):
@@ -552,16 +552,15 @@ def _fit(self, X, y, labels, parameter_iterable):
552552

553553
# Out is a list of triplet: score, estimator, n_test_samples
554554
n_fits = len(out)
555-
n_folds = cv.n_splits(X, y, labels)
556555

557556
scores = list()
558557
grid_scores = list()
559-
for grid_start in range(0, n_fits, n_folds):
558+
for grid_start in range(0, n_fits, len_cv):
560559
n_test_samples = 0
561560
score = 0
562561
all_scores = []
563562
for this_score, this_n_test_samples, _, parameters in \
564-
out[grid_start:grid_start + n_folds]:
563+
out[grid_start:grid_start + len_cv]:
565564
all_scores.append(this_score)
566565
if self.iid:
567566
this_score *= this_n_test_samples
@@ -570,7 +569,7 @@ def _fit(self, X, y, labels, parameter_iterable):
570569
if self.iid:
571570
score /= float(n_test_samples)
572571
else:
573-
score /= float(n_folds)
572+
score /= float(len_cv)
574573
scores.append((score, parameters))
575574
# TODO: shall we also store the test_fold_sizes?
576575
grid_scores.append(_CVScoreTuple(
@@ -667,10 +666,10 @@ class GridSearchCV(BaseSearchCV):
667666
For integer/None inputs, ``StratifiedKFold`` is used for classification
668667
tasks, when ``y`` is binary or multiclass.
669668
670-
See the :mod:`sklearn.model_selection.split` module for the list of
671-
cross-validation generators that can be used here.
669+
See the :mod:`sklearn.model_selection` module for the list of
670+
cross-validation strategies that can be used here.
672671
673-
Also refer :ref:`cross-validation documentation <_cross_validation>`
672+
Also refer :ref:`cross-validation documentation <cross_validation>`
674673
675674
refit : boolean, default=True
676675
Refit the best estimator with the entire dataset.
@@ -680,10 +679,6 @@ class GridSearchCV(BaseSearchCV):
680679
verbose : integer
681680
Controls the verbosity: the higher, the more messages.
682681
683-
random_state : int or RandomState
684-
Pseudo random number generator state used for random uniform sampling
685-
from lists of possible values instead of scipy.stats distributions.
686-
687682
error_score : 'raise' (default) or numeric
688683
Value to assign to the score if an error occurs in estimator fitting.
689684
If set to 'raise', the error is raised. If a numeric value is given,
@@ -877,10 +872,10 @@ class RandomizedSearchCV(BaseSearchCV):
877872
For integer/None inputs, ``StratifiedKFold`` is used for classification
878873
tasks, when ``y`` is binary or multiclass.
879874
880-
See the :mod:`sklearn.model_selection.split` module for the list of
881-
cross-validation generators that can be used here.
875+
See the :mod:`sklearn.model_selection` module for the list of
876+
cross-validation strategies that can be used here.
882877
883-
Also refer :ref:`cross-validation documentation <_cross_validation>`
878+
Also refer :ref:`cross-validation documentation <cross_validation>`
884879
885880
refit : boolean, default=True
886881
Refit the best estimator with the entire dataset.
@@ -890,13 +885,16 @@ class RandomizedSearchCV(BaseSearchCV):
890885
verbose : integer
891886
Controls the verbosity: the higher, the more messages.
892887
888+
random_state : int or RandomState
889+
Pseudo random number generator state used for random uniform sampling
890+
from lists of possible values instead of scipy.stats distributions.
891+
893892
error_score : 'raise' (default) or numeric
894893
Value to assign to the score if an error occurs in estimator fitting.
895894
If set to 'raise', the error is raised. If a numeric value is given,
896895
FitFailedWarning is raised. This parameter does not affect the refit
897896
step, which will always raise the error.
898897
899-
900898
Attributes
901899
----------
902900
grid_scores_ : list of named tuples

0 commit comments

Comments
 (0)
0