8000 MRG: Dummy estimators by mblondel · Pull Request #1373 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MRG: Dummy estimators #1373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Nov 22, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/model_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@

.. _model_selection:

Model Selection
-----------------------
Model selection and evaluation
------------------------------

.. toctree::

modules/cross_validation
modules/grid_search
modules/pipeline
modules/model_evaluation
23 changes: 23 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,29 @@ Samples generator
decomposition.dict_learning_online
decomposition.sparse_encode

.. _dummy_ref:

:mod:`sklearn.dummy`: Dummy estimators
======================================

.. automodule:: sklearn.dummy
:no-members:
:no-inherited-members:

**User guide:** See the :ref:`model_evaluation` section for further details.

.. currentmodule:: sklearn

.. autosummary::
:toctree: generated/
:template: class.rst

dummy.DummyClassifier
dummy.DummyRegressor

.. autosummary::
:toctree: generated/
:template: function.rst

.. _ensemble_ref:

Expand Down
64 changes: 64 additions & 0 deletions doc/modules/model_evaluation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
.. _model_evaluation:

===================
Model evaluation
===================

.. TODO

Metrics
=======


Dummy estimators
=================

.. currentmodule:: sklearn.dummy

When doing supervised learning, a simple sanity check consists in comparing one's
estimator against simple rules of thumb.
:class:`DummyClassifier` implements three such simple strategies for classification:

- `stratified` generates randomly predictions by respecting the training
set's class distribution,
- `most_frequent` always predicts the most frequent label in the training set,
- `uniform` generates predictions uniformly at random.

Note that with all these strategies, the `predict` method completely ignores
the input data!

To illustrate :class:`DummyClassifier`, first let's create an imbalanced
dataset::

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1

Next, let's compare the accuracy of `SVC` and `most_frequent`::

>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X, y)
>>> clf.score(X, y) # doctest: +ELLIPSIS
0.73...
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0).fit(X, y)
>>> clf.score(X, y) # doctest: +ELLIPSIS
0.66...

We see that `SVC` doesn't do much better than a dummy classifier. Now, let's change
the kernel::

>>> clf = SVC(kernel='rbf', C=1).fit(X, y)
>>> clf.score(X, y) # doctest: +ELLIPSIS
0.99...

We see that the accuracy was boosted to almost 100%.

More generally, when the accuracy of a classifier is too close to random classification, it
probably means that something went wrong: features are not helpful, a
hyparameter is not correctly tuned, the classifier is suffering from class
imbalance, etc...

:class:`DummyRegressor` implements a simple rule of thumb for regression:
always predict the mean of the training targets.
4 changes: 4 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ Changelog
"hashing trick" for fast, low-memory feature extraction from string data
by `Lars Buitinck`_.

- New dummy estimators :class:`dummy.DummyClassifiers` and
:class:`DummyRegressor` by `Mathieu Blondel`_. Useful to sanity-check your
estimators.


API changes summary
-------------------
Expand Down
212 changes: 212 additions & 0 deletions sklearn/dummy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@

# Author: Mathieu Blondel <mathieu@mblondel.org>
# License: BSD Style.

import numpy as np

from .base import BaseEstimator, ClassifierMixin, RegressorMixin
from .preprocessing import LabelEncoder
from .utils import check_random_state
from .utils.fixes import unique
from .utils.validation import safe_asarray


class DummyClassifier(BaseEstimator, ClassifierMixin):
"""
DummyClassifier is a classifier that makes predictions using simple rules.

This classifier is useful as a simple baseline to compare with other
(real) classifiers. Do not use it for real problems.

Parameters
----------
strategy: str
Strategy to use to generate predictions.
* "stratified": generates predictions by respecting the training
set's class distribution.
* "most_frequent": always predicts the most frequent label in the
training set.
* "uniform": generates predictions uniformly at random.

random_state: int seed, RandomState instance, or None (default)
The seed of the pseudo random number generator to use.

Attributes
----------
`classes_` : array, shape = [n_classes]
Class labels.

`class_prior_` : array, shape = [n_classes]
Probability of each class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a scalar in the binary case? That's how we handle intercept_ in the linear models.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like treating the binary case differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I called it class_prior_ to follow MultinomialNB. Do you treat the binary case differently in it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither do I, but it's a long-standing convention.

As for mimicking NB: I've been considering changing it's behavior, I just never got round to it. It does some stuff differently from all the other estimators, and I feel it should be more like the other linear models.

"""

def __init__(self, strategy="stratified", random_state=None):
self.strategy = strategy
self.random_state = random_state

def fit(self, X, y):
"""Fit the random classifier.

Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.

y : array-like, shape = [n_samples]
Target values.

Returns
-------
self : object
Returns self.
"""
if self.strategy not in ("most_frequent", "stratified", "uniform"):
raise ValueError("Unknown strategy type.")

self.classes_, y = unique(y, return_inverse=True)
self.class_prior_ = np.bincount(y) / float(y.shape[0])
return self

def predict(self, X):
"""
Perform classification on test vectors X.

Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Input vectors, where n_samples is the number of samples
and n_features is the number of features.

Returns
-------
y : array, shape = [n_samples]
Predicted target values for X.
"""
if not hasattr(self, "classes_"):
raise ValueError("DummyClassifier not fitted.")

X = safe_asarray(X)
n_samples = X.shape[0]
rs = check_random_state(self.random_state)

if self.strategy == "most_frequent":
ret = np.ones(n_samples, dtype=int) * self.class_prior_.argmax()
elif self.strategy == "stratified":
ret = self.predict_proba(X).argmax(axis=1)
elif self.strategy == "uniform":
ret = rs.randint(len(self.classes_), size=n_samples)

return self.classes_[ret]

def predict_proba(self, X):
"""
Return probability estimates for the test vectors X.

Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Input vectors, where n_samples is the number of samples
and n_features is the number of features.

Returns
-------
P : array-like, shape = [n_samples, n_classes]
Returns the probability of the sample for each class in
the model, where classes are ordered arithmetically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an n_samples array in the binary case. Same goes for predict_log_proba.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe SGD is returning [n_samples, n_classes] in all cases now.

"""
if not hasattr(self, "classes_"):
raise ValueError("DummyClassifier not fitted.")

X = safe_asarray(X)
n_samples = X.shape[0]
n_classes = len(self.classes_)
rs = check_random_state(self.random_state)

if self.strategy == "most_frequent":
ind = np.ones(n_samples, dtype=int) * self.class_prior_.argmax()
out = np.zeros((n_samples, n_classes), dtype=np.float64)
out[:, ind] = 1.0
elif self.strategy == "stratified":
out = rs.multinomial(1, self.class_prior_, size=n_samples)
elif self.strategy == "uniform":
out = np.ones((n_samples, n_classes), dtype=np.float64)
out /= n_classes

return out

def predict_log_proba(self, X):
"""
Return log probability estimates for the test vectors X.

Parameters
----------
10000 X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Input vectors, where n_samples is the number of samples
and n_features is the number of features.

Returns
-------
P : array-like, shape = [n_samples, n_classes]
Returns the log probability of the sample for each class in
the model, where classes are ordered arithmetically.
"""
return np.log(self.predict_proba(X))


class DummyRegressor(BaseEstimator, RegressorMixin):
"""
DummyRegressor is a regressor that always predicts the mean of the training
targets.

This regressor is useful as a simple baseline to compare with other
(real) regressors. Do not use it for real problems.

Attributes
----------
`y_mean_` : float
Mean of the training targets.
"""

def fit(self, X, y):
"""Fit the random regressor.

Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.

y : array-like, shape = [n_samples]
Target values.

Returns
-------
self : object
Returns self.
"""
self.y_mean_ = np.mean(y)
return self

def predict(self, X):
"""
Perform classification on test vectors X.

Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Input vectors, where n_samples is the number of samples
and n_features is the number of features.

Returns
-------
y : array, shape = [n_samples]
Predicted target values for X.
"""
if not hasattr(self, "y_mean_"):
raise ValueError("DummyRegressor not fitted.")

X = safe_asarray(X)
n_samples = X.shape[0]

return np.ones(n_samples) * self.y_mean_
11 changes: 7 additions & 4 deletions sklearn/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,11 @@
from sklearn.utils.testing import assert_true
from sklearn.utils.testing import assert_array_equal
from sklearn.utils.testing import assert_array_almost_equal
from sklearn.utils.testing import all_estimators
from sklearn.utils.testing import set_random_state
from sklearn.utils.testing import assert_greater

import sklearn
from sklearn.utils.testing import all_estimators, set_random_state
from sklearn.utils.testing import assert_greater
from sklearn.base import clone, ClassifierMixin, RegressorMixin, \
TransformerMixin, ClusterMixin
from sklearn.utils import shuffle
Expand All @@ -41,6 +42,7 @@
from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier,\
OutputCodeClassifier
from sklearn.feature_selection import RFE, RFECV, SelectKBest
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.covariance import EllipticEnvelope, EllipticEnvelop
from sklearn.feature_extraction import DictVectorizer, FeatureHasher
Expand All @@ -55,7 +57,7 @@
dont_test = [Pipeline, FeatureUnion, GridSearchCV, SparseCoder,
EllipticEnvelope, EllipticEnvelop, DictVectorizer, LabelBinarizer,
LabelEncoder, TfidfTransformer, IsotonicRegression, OneHotEncoder,
RandomTreesEmbedding, FeatureHasher]
RandomTreesEmbedding, FeatureHasher, DummyClassifier, DummyRegressor]
meta_estimators = [BaseEnsemble, OneVsOneClassifier, OutputCodeClassifier,
OneVsRestClassifier, RFE, RFECV]

Expand Down Expand Up @@ -508,7 +510,8 @@ def test_classifiers_classes():
y_pred = clf.predict(X)
# training set performance
assert_array_equal(np.unique(y), np.unique(y_pred))
assert_greater(zero_one_score(y, y_pred), 0.78)
assert_greater(zero_one_score(y, y_pred), 0.78,
"accuracy of %s not greater than 0.78" % str(Clf))


def test_regressors_int():
Expand Down
Loading
0