8000 [MRG + 2] ENH Allow `cross_val_score`, `GridSearchCV` et al. to evalu… · dmohns/scikit-learn@11c99d6 · GitHub
[go: up one dir, main page]

Skip to content

Commit 11c99d6

Browse files
raghavrvdmohns
authored andcommitted
[MRG + 2] ENH Allow cross_val_score, GridSearchCV et al. to evaluate on multiple metrics (scikit-learn#7388)
* ENH cross_val_score now supports multiple metrics * DOCFIX permutation_test_score * ENH validate multiple metric scorers * ENH Move validation of multimetric scoring param out * ENH GridSearchCV and RandomizedSearchCV now support multiple metrics * EXA Add an example demonstrating the multiple metric in GridSearchCV * ENH Let check_multimetric_scoring tell if its multimetric or not * FIX For single metric name of scorer should remain 'score' * ENH validation_curve and learning_curve now support multiple metrics * MNT move _aggregate_score_dicts helper into _validation.py * TST More testing/ Fixing scores to the correct values * EXA Add cross_val_score to multimetric example * Rename to multiple_metric_evaluation.py * MNT Remove scaffolding * FIX doctest imports * FIX wrap the scorer and unwrap the score when using _score() in rfe * TST Cleanup the tests. Test for is_multimetric too * TST Make sure it registers as single metric when scoring is of that type * PEP8 * Don't use dict comprehension to make it work in python2.6 * ENH/FIX/TST grid_scores_ should not be available for multimetric evaluation * FIX+TST delegated methods NA when multimetric is enabled... TST Add general tests to GridSearchCV and RandomizedSearchCV * ENH add option to disable delegation on multimetric scoring * Remove old function from __all__ * flake8 * FIX revert disable_on_multimetric * stash * Fix incorrect rebase * [ci skip] * Make sure refit works as expected and remove irrelevant tests * Allow passing standard scorers by name in multimetric scorers * Fix example * flake8 * Address reviews * Fix indentation * Ensure {'acc': 'accuracy'} and ['precision'] are valid inputs * Test that for single metric, 'score' is a key * Typos * Fix incorrect rebase * Compare multimetric grid search with multiple single metric searches * Test X, y list and pandas input; Test multimetric for unsupervised grid search * Fix tests; Unsupervised multimetric gs will not pass until scikit-learn#8117 is merged * Make a plot of Precision vs ROC AUC for RandomForest varying the n_estimators * Add example to grid_search.rst * Use the classic tuning of C param in SVM instead of estimators in RF * FIX Remove scoring arg in deafult scorer test * flake8 * Search for min_samples_split in DTC; Also show f-score * REVIEW Make check_multimetric_scoring private * FIX Add more samples to see if 3% mismatch on 32 bit systems gets fixed * REVIEW Plot best score; Shorten legends * REVIEW/COSMIT multimetric --> multi-metric * REVIEW Mark the best scores of P/R scores too * Revert "FIX Add more samples to see if 3% mismatch on 32 bit systems gets fixed" This reverts commit ba766d9. * ENH Use looping for iid testing * FIX use param grid as scipy's stats dist in 0.12 do not accept seed * ENH more looping less code; Use small non-noisy dataset * FIX Use named arg after expanded args * TST More testing of the refit parameter * Test that in multimetric search refit to single metric, the delegated methods work as expected. * Test that setting probability=False works with multimetric too * Test refit=False gives sensible error * COSMIT multimetric --> multi-metric * REV Correct example doc * COSMIT * REVIEW Make tests stronger; Fix bugs in _check_multimetric_scorer * REVIEW refit param: Raise for empty strings * TST Invalid refit params * REVIEW Use <scorer_name> alone; recall --> Recall * REV specify when we expect scorers to not be None * FLAKE8 * REVERT multimetrics in learning_curve and validation_curve * REVIEW Simpler coding style * COSMIT * COSMIT * REV Compress example a bit. Move comment to top * FIX fit_grid_point's previous API must be preserved * Flake8 * TST Use loop; Compare with single-metric * REVIEW Use dict-comprehension instead of helper * REVIEW Remove redundant test * Fix tests incorrect braces * COSMIT * REVIEW Use regexp * REV Simplify aggregation of score dicts * FIX precision and accuracy test * FIX doctest and flake8 * TST the best_* attributes multimetric with single metric * Address @jnothman's review * Address more comments \o/ * DOCFIXES * Fix use the validated fit_param from fit's arguments * Revert alpha to a lower value as before * Using def instead of lambda * Address @jnothman's review batch 1: Fix tests / Doc fixes * Remove superfluous tests * Remove more superfluous testing * TST/FIX loop over refit and check found n_clusters * Cosmetic touches * Use zip instead of manually listing the keys * Fix inverse_transform * FIX bug in fit_grid_point; Allow only single score TST if fit_grid_point works as intended * ENH Use only ROC-AUC and F1-score * Fix typos and flake8; Address Andy's reviews MNT Add a comment on why we do such a transpose + some fixes * ENH Better error messages for incorrect multimetric scoring values +... ENH Avoid exception traceback while using incorrect scoring string * Dict keys must be of string type only * 1. Better error message for invalid scoring 2... Internal functions return single score for single metric scoring * Fix test failures and shuffle tests * Avoid wrapping scorer as dict in learning_curve * Remove doc example as asked for * Some leftover ones * Don't wrap scorer in validation_curve either * Add a doc example and skip it as dict order fails doctest * Import zip from six for python2.7 compat * Make cross_val_score return a cv_results-like dict * Add relevant sections to userguide * Flake8 fixes * Add whatsnew and fix broken links * Use AUC and accuracy instead of f1 * Fix failing doctests cross_validation.rst * DOC add the wrapper example for metrics that return multiple return values * Address andy's comments * Be less weird * Address more of andy's comments * Make a separate cross_validate function to return dict and a cross_val_score * Update the docs to reflect the new cross_validate function * Add cross_validate to toc-tree * Add more tests on type of cross_validate return and time limits * FIX failing doctests * FIX ensure keys are not plural * DOC fix * Address some pending comments * Remove the comment as it is irrelevant now * Remove excess blank line * Fix flake8 inconsistencies * Allow fit_times to be 0 to conform with windows precision * DOC specify how refit param is to be set in multiple metric case * TST ensure cross_validate works for string single metrics + address @jnothman's reviews * Doc fixes * Remove the shape and transform parameter of _aggregate_score_dicts * Address Joel's doc comments * Fix broken doctest * Fix the spurious file * Address Andy's comments * MNT Remove erroneous entry * Address Andy's comments * FIX broken links * Update whats_new.rst missing newline
1 parent 4b0248e commit 11c99d6

File tree

13 files changed

+1406
-237
lines changed

13 files changed

+1406
-237
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,7 @@ Model validation
223223
:toctree: generated/
224224
:template: function.rst
225225

226+
model_selection.cross_validate
226227
model_selection.cross_val_score
227228
model_selection.cross_val_predict
228229
model_selection.permutation_test_score

doc/modules/cross_validation.rst

Lines changed: 60 additions & 1 deletion
< 8000 /div>
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,65 @@ validation iterator instead, for instance::
172172

173173
See :ref:`combining_estimators`.
174174

175+
176+
.. _multimetric_cross_validation:
177+
178+
The cross_validate function and multiple metric evaluation
179+
----------------------------------------------------------
180+
181+
The ``cross_validate`` function differs from ``cross_val_score`` in two ways -
182+
183+
- It allows specifying multiple metrics for evaluation.
184+
185+
- It returns a dict containing training scores, fit-times and score-times in
186+
addition to the test score.
187+
188+
For single metric evaluation, where the scoring parameter is a string,
189+
callable or None, the keys will be - ``['test_score', 'fit_time', 'score_time']``
190+
191+
And for multiple metric evaluation, the return value is a dict with the
192+
following keys -
193+
``['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']``
194+
195+
``return_train_score`` is set to ``True`` by default. It adds train score keys
196+
for all the scorers. If train scores are not needed, this should be set to
197+
``False`` explicitly.
198+
199+
The multiple metrics can be specified either as a list, tuple or set of
200+
predefined scorer names::
201+
202+
>>> from sklearn.model_selection import cross_validate
203+
>>> from sklearn.metrics import recall_score
204+
>>> scoring = ['precision_macro', 'recall_macro']
205+
>>> clf = svm.SVC(kernel='linear', C=1, random_state=0)
206+
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
207+
... cv=5, return_train_score=False)
208+
>>> sorted(scores.keys())
209+
['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
210+
>>> scores['test_recall_macro'] # doctest: +ELLIPSIS
211+
array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
212+
213+
Or as a dict mapping scorer name to a predefined or custom scoring function::
214+
215+
>>> from sklearn.metrics.scorer import make_scorer
216+
>>> scoring = {'prec_macro': 'precision_macro',
217+
... 'rec_micro': make_scorer(recall_score, average='macro')}
218+
>>> scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
219+
... cv=5, return_train_score=True)
220+
>>> sorted(scores.keys()) # doctest: +NORMALIZE_WHITESPACE
221+
['fit_time', 'score_time', 'test_prec_macro', 'test_rec_micro',
222+
'train_prec_macro', 'train_rec_micro']
223+
>>> scores['train_rec_micro'] # doctest: +ELLIPSIS
224+
array([ 0.97..., 0.97..., 0.99..., 0.98..., 0.98...])
225+
226+
Here is an example of ``cross_validate`` using a single metric::
227+
228+
>>> scores = cross_validate(clf, iris.data, iris.target,
229+
... scoring='precision_macro')
230+
>>> sorted(scores.keys())
231+
['fit_time', 'score_time', 'test_score', 'train_score']
232+
233+
175234
Obtaining predictions by cross-validation
176235
-----------------------------------------
177236

@@ -186,7 +245,7 @@ These prediction can then be used to evaluate the classifier::
186245
>>> from sklearn.model_selection import cross_val_predict
187246
>>> predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
188247
>>> metrics.a F438 ccuracy_score(iris.target, predicted) # doctest: +ELLIPSIS
189-
0.966...
248+
0.973...
190249

191250
Note that the result of this computation may be slightly different from those
192251
obtained using :func:`cross_val_score` as the elements are grouped in different

doc/modules/grid_search.rst

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,10 @@ evaluated and the best combination is retained.
8484
dataset. This is the best practice for evaluating the performance of a
8585
model with grid search.
8686

87+
- See :ref:`sphx_glr_auto_examples_model_selection_plot_multi_metric_evaluation`
88+
for an example of :class:`GridSearchCV` being used to evaluate multiple
89+
metrics simultaneously.
90+
8791
.. _randomized_parameter_search:
8892

8993
Randomized Parameter Optimization
@@ -161,6 +165,27 @@ scoring function can be specified via the ``scoring`` parameter to
161165
specialized cross-validation tools described below.
162166
See :ref:`scoring_parameter` for more details.
163167

168+
.. _multimetric_grid_search:
169+
170+
Specifying multiple metrics for evaluation
171+
------------------------------------------
172+
173+
``GridSearchCV`` and ``RandomizedSearchCV`` allow specifying multiple metrics
174+
for the ``scoring`` parameter.
175+
176+
Multimetric scoring can either be specified as a list of strings of predefined
177+
scores names or a dict mapping the scorer name to the scorer function and/or
178+
the predefined scorer name(s). See :ref:`multimetric_scoring` for more details.
179+
180+
When specifying multiple metrics, the ``refit`` parameter must be set to the
181+
metric (string) for which the ``best_params_`` will be found and used to build
182+
the ``best_estimator_`` on the whole dataset. If the search should not be
183+
refit, set ``refit=False``. Leaving refit to the default value ``None`` will
184+
result in an error when using multiple metrics.
< 10000 /code>
185+
186+
See :ref:`sphx_glr_auto_examples_model_selection_plot_multi_metric_evaluation`
187+
for an example usage.
188+
164189
Composite estimators and parameter spaces
165190
-----------------------------------------
166191

doc/modules/model_evaluation.rst

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,51 @@ the following two rules:
210210
Again, by convention higher numbers are better, so if your scorer
211211
returns loss, that value should be negated.
212212

213+
.. _multimetric_scoring:
214+
215+
Using mutiple metric evaluation
216+
-------------------------------
217+
218+
Scikit-learn also permits evaluation of multiple metrics in ``GridSearchCV``,
219+
``RandomizedSearchCV`` and ``cross_validate``.
220+
221+
There are two ways to specify multiple scoring metrics for the ``scoring``
222+
parameter:
223+
224+
- As an iterable of string metrics::
225+
>>> scoring = ['accuracy', 'precision']
226+
227+
- As a ``dict`` mapping the scorer name to the scoring function::
228+
>>> from sklearn.metrics import accuracy_score
229+
>>> from sklearn.metrics import make_scorer
230+
>>> scoring = {'accuracy': make_scorer(accuracy_score),
231+
... 'prec': 'precision'}
232+
233+
Note that the dict values can either be scorer functions or one of the
234+
predefined metric strings.
235+
236+
Currently only those scorer functions that return a single score can be passed
237+
inside the dict. Scorer functions that return multiple values are not
238+
permitted and will require a wrapper to return a single metric::
239+
240+
>>> from sklearn.model_selection import cross_validate
241+
>>> from sklearn.metrics import confusion_matrix
242+
>>> # A sample toy binary classification dataset
243+
>>> X, y = datasets.make_classification(n_classes=2, random_state=0)
244+
>>> svm = LinearSVC(random_state=0)
245+
>>> tp = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 0]
246+
>>> tn = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 0]
247+
>>> fp = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[1, 0]
248+
>>> fn = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 1]
249+
>>> scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn),
250+
... 'fp' : make_scorer(fp), 'fn' : make_scorer(fn)}
251+
>>> cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring)
252+
>>> # Getting the test set false positive scores
253+
>>> print(cv_results['test_tp']) # doctest: +NORMALIZE_WHITESPACE
254+
[12 13 15]
255+
>>> # Getting the test set false negative scores
256+
>>> print(cv_results['test_fn']) # doctest: +NORMALIZE_WHITESPACE
257+
[5 4 1]
213258

214259
.. _classification_metrics:
215260

doc/whats_new.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,19 @@ Changelog
3131
New features
3232
............
3333

34+
- :class:`model_selection.GridSearchCV` and
35+
:class:`model_selection.RandomizedSearchCV` now support simultaneous
36+
evaluation of multiple metrics. Refer to the
37+
:ref:`multimetric_grid_search` section of the user guide for more
38+
information. :issue:`7388` by `Raghav RV`_
39+
40+
- Added the :func:`model_selection.cross_validate` which allows evaluation
41+
of multiple metrics. This function returns a dict with more useful
42+
information from cross-validation such as the train scores, fit times and
43+
score times.
44+
Refer to :ref:`multimetric_cross_validation` section of the userguide
45+
for more information. :issue:`7388` by `Raghav RV`_
46+
3447
- Added :class:`multioutput.ClassifierChain` for multi-label
3548
classification. By `Adam Kleczewski <adamklec>`_.
3649

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
"""Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV
2+
3+
Multiple metric parameter search can be done by setting the ``scoring``
4+
parameter to a list of metric scorer names or a dict mapping the scorer names
5+
to the scorer callables.
6+
7+
The scores of all the scorers are available in the ``cv_results_`` dict at keys
8+
ending in ``'_<scorer_name>'`` (``'mean_test_precision'``,
9+
``'rank_test_precision'``, etc...)
10+
11+
The ``best_estimator_``, ``best_index_``, ``best_score_`` and ``best_params_``
12+
correspond to the scorer (key) that is set to the ``refit`` attribute.
13+
"""
14+
15+
# Author: Raghav RV <rvraghav93@gmail.com>
16+
# License: BSD
17+
18+
import numpy as np
19+
from matplotlib import pyplot as plt
20+
21+
from sklearn.datasets import make_hastie_10_2
22+
from sklearn.model_selection import GridSearchCV
23+
from sklearn.metrics import make_scorer
24+
from sklearn.metrics import accuracy_score
25+
from sklearn.tree import DecisionTreeClassifier
26+
27+
print(__doc__)
28+
29+
###############################################################################
30+
# Running ``GridSearchCV`` using multiple evaluation metrics
31+
# ----------------------------------------------------------
32+
#
33+
34+
X, y = make_hastie_10_2(n_samples=8000, random_state=42)
35+
36+
# The scorers can be either be one of the predefined metric strings or a scorer
37+
# callable, like the one returned by make_scorer
38+
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
39+
40+
# Setting refit='AUC', refits an estimator on the whole dataset with the
41+
# parameter setting that has the best cross-validated AUC score.
42+
# That estimator is made available at ``gs.best_estimator_`` along with
43+
# parameters like ``gs.best_score_``, ``gs.best_parameters_`` and
44+
# ``gs.best_index_``
45+
gs = GridSearchCV(DecisionTreeClassifier(random_state=42),
46+
param_grid={'min_samples_split': range(2, 403, 10)},
47+
scoring=scoring, cv=5, refit='AUC')
48+
gs.fit(X, y)
49+
results = gs.cv_results_
50+
51+
###############################################################################
52+
# Plotting the result
53+
# -------------------
54+
55+
plt.figure(figsize=(13, 13))
56+
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",
57+
fontsize=16)
58+
59+
plt.xlabel("min_samples_split")
60+
plt.ylabel("Score")
61+
plt.grid()
62+
63+
ax = plt.axes()
64+
ax.set_xlim(0, 402)
65+
ax.set_ylim(0.73, 1)
66+
67+
# Get the regular numpy array from the MaskedArray
68+
X_axis = np.array(results['param_min_samples_split'].data, dtype=float)
69+
70+
for scorer, color in zip(sorted(scoring), ['g', 'k']):
71+
for sample, style in (('train', '--'), ('test', '-')):
72+
sample_score_mean = results['mean_%s_%s' % (sample, scorer)]
73+
sample_score_std = results['std_%s_%s' % (sample, scorer)]
74+
ax.fill_between(X_axis, sample_score_mean - sample_score_std,
75+
sample_score_mean + sample_score_std,
76+
alpha=0.1 if sample == 'test' else 0, color=color)
77+
ax.plot(X_axis, sample_score_mean, style, color=color,
78+
alpha=1 if sample == 'test' else 0.7,
79+
label="%s (%s)" % (scorer, sample))
80+
81+
best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
82+
best_score = results['mean_test_%s' % scorer][best_index]
83+
84+
# Plot a dotted vertical line at the best score for that scorer marked by x
85+
ax.plot([X_axis[best_index], ] * 2, [0, best_score],
86+
linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)
87+
88+
# Annotate the best score for that scorer
89+
ax.annotate("%0.2f" % best_score,
90+
(X_axis[best_index], best_score + 0.005))
91+
92+
plt.legend(loc="best")
93+
plt.grid('off')
94+
plt.show()

0 commit comments

Comments
 (0)
0