-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
MRG: Dummy estimators #1373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: Dummy estimators #1373
Changes from all commits
94f1218
76253a2
3fb6425
7358fd4
637a491
146da09
ac90b7b
30aa21c
d756c33
607c136
cb8989a
09cdd40
86aad25
9adeaa1
e157836
55f8ab0
99f2375
ef50f3e
6530f44
8430d66
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
.. _model_evaluation: | ||
|
||
=================== | ||
Model evaluation | ||
=================== | ||
|
||
.. TODO | ||
|
||
Metrics | ||
======= | ||
|
||
|
||
Dummy estimators | ||
================= | ||
|
||
.. currentmodule:: sklearn.dummy | ||
|
||
When doing supervised learning, a simple sanity check consists in comparing one's | ||
estimator against simple rules of thumb. | ||
:class:`DummyClassifier` implements three such simple strategies for classification: | ||
|
||
- `stratified` generates randomly predictions by respecting the training | ||
set's class distribution, | ||
- `most_frequent` always predicts the most frequent label in the training set, | ||
- `uniform` generates predictions uniformly at random. | ||
|
||
Note that with all these strategies, the `predict` method completely ignores | ||
the input data! | ||
|
||
To illustrate :class:`DummyClassifier`, first let's create an imbalanced | ||
dataset:: | ||
|
||
>>> from sklearn.datasets import load_iris | ||
>>> iris = load_iris() | ||
>>> X, y = iris.data, iris.target | ||
>>> y[y != 1] = -1 | ||
|
||
Next, let's compare the accuracy of `SVC` and `most_frequent`:: | ||
|
||
>>> from sklearn.dummy import DummyClassifier | ||
>>> from sklearn.svm import SVC | ||
>>> clf = SVC(kernel='linear', C=1).fit(X, y) | ||
>>> clf.score(X, y) # doctest: +ELLIPSIS | ||
0.73... | ||
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0).fit(X, y) | ||
>>> clf.score(X, y) # doctest: +ELLIPSIS | ||
0.66... | ||
|
||
We see that `SVC` doesn't do much better than a dummy classifier. Now, let's change | ||
the kernel:: | ||
|
||
>>> clf = SVC(kernel='rbf', C=1).fit(X, y) | ||
>>> clf.score(X, y) # doctest: +ELLIPSIS | ||
0.99... | ||
|
||
We see that the accuracy was boosted to almost 100%. | ||
|
||
More generally, when the accuracy of a classifier is too close to random classification, it | ||
probably means that something went wrong: features are not helpful, a | ||
hyparameter is not correctly tuned, the classifier is suffering from class | ||
imbalance, etc... | ||
|
||
:class:`DummyRegressor` implements a simple rule of thumb for regression: | ||
always predict the mean of the training targets. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
|
||
# Author: Mathieu Blondel <mathieu@mblondel.org> | ||
# License: BSD Style. | ||
|
||
import numpy as np | ||
|
||
from .base import BaseEstimator, ClassifierMixin, RegressorMixin | ||
from .preprocessing import LabelEncoder | ||
from .utils import check_random_state | ||
from .utils.fixes import unique | ||
from .utils.validation import safe_asarray | ||
|
||
|
||
class DummyClassifier(BaseEstimator, ClassifierMixin): | ||
""" | ||
DummyClassifier is a classifier that makes predictions using simple rules. | ||
|
||
This classifier is useful as a simple baseline to compare with other | ||
(real) classifiers. Do not use it for real problems. | ||
|
||
Parameters | ||
---------- | ||
strategy: str | ||
Strategy to use to generate predictions. | ||
* "stratified": generates predictions by respecting the training | ||
set's class distribution. | ||
* "most_frequent": always predicts the most frequent label in the | ||
training set. | ||
* "uniform": generates predictions uniformly at random. | ||
|
||
random_state: int seed, RandomState instance, or None (default) | ||
The seed of the pseudo random number generator to use. | ||
|
||
Attributes | ||
---------- | ||
`classes_` : array, shape = [n_classes] | ||
Class labels. | ||
|
||
`class_prior_` : array, shape = [n_classes] | ||
Probability of each class. | ||
""" | ||
|
||
def __init__(self, strategy="stratified", random_state=None): | ||
self.strategy = strategy | ||
self.random_state = random_state | ||
|
||
def fit(self, X, y): | ||
"""Fit the random classifier. | ||
|
||
Parameters | ||
---------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
Training vectors, where n_samples is the number of samples | ||
and n_features is the number of features. | ||
|
||
y : array-like, shape = [n_samples] | ||
Target values. | ||
|
||
Returns | ||
------- | ||
self : object | ||
Returns self. | ||
""" | ||
if self.strategy not in ("most_frequent", "stratified", "uniform"): | ||
raise ValueError("Unknown strategy type.") | ||
|
||
self.classes_, y = unique(y, return_inverse=True) | ||
self.class_prior_ = np.bincount(y) / float(y.shape[0]) | ||
return self | ||
|
||
def predict(self, X): | ||
""" | ||
Perform classification on test vectors X. | ||
|
||
Parameters | ||
---------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
Input vectors, where n_samples is the number of samples | ||
and n_features is the number of features. | ||
|
||
Returns | ||
------- | ||
y : array, shape = [n_samples] | ||
Predicted target values for X. | ||
""" | ||
if not hasattr(self, "classes_"): | ||
raise ValueError("DummyClassifier not fitted.") | ||
|
||
X = safe_asarray(X) | ||
n_samples = X.shape[0] | ||
rs = check_random_state(self.random_state) | ||
|
||
if self.strategy == "most_frequent": | ||
ret = np.ones(n_samples, dtype=int) * self.class_prior_.argmax() | ||
elif self.strategy == "stratified": | ||
ret = self.predict_proba(X).argmax(axis=1) | ||
elif self.strategy == "uniform": | ||
ret = rs.randint(len(self.classes_), size=n_samples) | ||
|
||
return self.classes_[ret] | ||
|
||
def predict_proba(self, X): | ||
""" | ||
Return probability estimates for the test vectors X. | ||
|
||
Parameters | ||
---------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
Input vectors, where n_samples is the number of samples | ||
and n_features is the number of features. | ||
|
||
Returns | ||
------- | ||
P : array-like, shape = [n_samples, n_classes] | ||
Returns the probability of the sample for each class in | ||
the model, where classes are ordered arithmetically. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe SGD is returning |
||
""" | ||
if not hasattr(self, "classes_"): | ||
raise ValueError("DummyClassifier not fitted.") | ||
|
||
X = safe_asarray(X) | ||
n_samples = X.shape[0] | ||
n_classes = len(self.classes_) | ||
rs = check_random_state(self.random_state) | ||
|
||
if self.strategy == "most_frequent": | ||
ind = np.ones(n_samples, dtype=int) * self.class_prior_.argmax() | ||
out = np.zeros((n_samples, n_classes), dtype=np.float64) | ||
out[:, ind] = 1.0 | ||
elif self.strategy == "stratified": | ||
out = rs.multinomial(1, self.class_prior_, size=n_samples) | ||
elif self.strategy == "uniform": | ||
out = np.ones((n_samples, n_classes), dtype=np.float64) | ||
out /= n_classes | ||
|
||
return out | ||
|
||
def predict_log_proba(self, X): | ||
""" | ||
Return log probability estimates for the test vectors X. | ||
|
||
Parameters | ||
---------- | ||
10000 X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
Input vectors, where n_samples is the number of samples | ||
and n_features is the number of features. | ||
|
||
Returns | ||
------- | ||
P : array-like, shape = [n_samples, n_classes] | ||
Returns the log probability of the sample for each class in | ||
the model, where classes are ordered arithmetically. | ||
""" | ||
return np.log(self.predict_proba(X)) | ||
|
||
|
||
class DummyRegressor(BaseEstimator, RegressorMixin): | ||
""" | ||
DummyRegressor is a regressor that always predicts the mean of the training | ||
targets. | ||
|
||
This regressor is useful as a simple baseline to compare with other | ||
(real) regressors. Do not use it for real problems. | ||
|
||
Attributes | ||
---------- | ||
`y_mean_` : float | ||
Mean of the training targets. | ||
""" | ||
|
||
def fit(self, X, y): | ||
"""Fit the random regressor. | ||
|
||
Parameters | ||
---------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
Training vectors, where n_samples is the number of samples | ||
and n_features is the number of features. | ||
|
||
y : array-like, shape = [n_samples] | ||
Target values. | ||
|
||
Returns | ||
------- | ||
self : object | ||
Returns self. | ||
""" | ||
self.y_mean_ = np.mean(y) | ||
return self | ||
|
||
def predict(self, X): | ||
""" | ||
Perform classification on test vectors X. | ||
|
||
Parameters | ||
---------- | ||
X : {array-like, sparse matrix}, shape = [n_samples, n_features] | ||
Input vectors, where n_samples is the number of samples | ||
and n_features is the number of features. | ||
|
||
Returns | ||
------- | ||
y : array, shape = [n_samples] | ||
Predicted target values for X. | ||
""" | ||
if not hasattr(self, "y_mean_"): | ||
raise ValueError("DummyRegressor not fitted.") | ||
|
||
X = safe_asarray(X) | ||
n_samples = X.shape[0] | ||
|
||
return np.ones(n_samples) * self.y_mean_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a scalar in the binary case? That's how we handle
intercept_
in the linear models.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like treating the binary case differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I called it
class_prior_
to followMultinomialNB
. Do you treat the binary case differently in it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither do I, but it's a long-standing convention.
As for mimicking NB: I've been considering changing it's behavior, I just never got round to it. It does some stuff differently from all the other estimators, and I feel it should be more like the other linear models.