8000 Merge pull request #1 from ogrisel/glouppe-ensemble-rebased · larsmans/scikit-learn@c3cd700 · GitHub
[go: up one dir, main page]

Skip to content
Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit c3cd700

Browse files
committed
Merge pull request #1 from ogrisel/glouppe-ensemble-rebased
2 parents f071368 + 50d8ac2 commit c3cd700

File tree

2 files changed

+85
-29
lines changed

2 files changed

+85
-29
lines changed

doc/modules/ensemble.rst

Lines changed: 36 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Forests of randomized trees
3434
The ``sklearn.ensemble`` module includes two averaging algorithms based on
3535
randomized :ref:`decision trees <tree>`: the RandomForest algorithm and the
3636
Extra-Trees method. Both algorithms are perturb-and-combine techniques
37-
specifically designed for trees.
37+
specifically designed for trees::
3838

3939
>>> from sklearn.ensemble import RandomForestClassifier
4040
>>> X = [[0, 0], [1, 1]]
@@ -60,39 +60,50 @@ features is used, but instead of looking for the most discriminative thresholds,
6060
thresholds are drawn at random for each candidate feature and the best of these
6161
randomly-generated thresholds is picked as the splitting rule. This usually
6262
allows to reduce the variance of the model a bit more, at the expense of a
63-
slightly greater increase in bias.
63+
slightly greater increase in bias::
6464

6565
>>> from sklearn.cross_validation import cross_val_score
6666
>>> from sklearn.datasets import make_blobs
6767
>>> from sklearn.ensemble import RandomForestClassifier
6868
>>> from sklearn.ensemble import ExtraTreesClassifier
6969
>>> from sklearn.tree import DecisionTreeClassifier
70-
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100)
71-
>>> clf = DecisionTreeClassifier(max_depth=None, min_split=1)
70+
71+
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
72+
... random_state=0)
73+
74+
>>> clf = DecisionTreeClassifier(max_depth=None, min_split=1,
75+
... random_state=0)
7276
>>> scores = cross_val_score(clf, X, y)
73-
>>> scores.mean()
74-
0.97609967955403809
75-
>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_split=1)
77+
>>> scores.mean() # doctest: +ELLIPSIS
78+
0.978...
79+
80+
>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
81+
... min_split=1, random_state=0)
7682
>>> scores = cross_val_score(clf, X, y)
77-
>>> scores.mean()
78-
0.99510028987301846
79-
>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_split=1)
83+
>>> scores.mean() # doctest: +ELLIPSIS
84+
0.992...
85+
86+
>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
87+
... min_split=1, random_state=0)
8088
>>> scores = cross_val_score(clf, X, y)
81-
>>> scores.mean()
82-
1.0
83-
84-
The main parameters to adjust when using these methods is ``n_estimators`` and
85-
``max_features``. The former is the number of trees in the forest. The larger
86-
the better, but also the longer it will take to compute. The latter is the size
87-
of the random subsets of features to consider when splitting a node. The lower
88-
the greater the reduction of variance, but also the greater the increase in
89-
bias. Empiricial good default values are ``max_features=M`` in random forests,
90-
and ``max_features=sqrt(M)`` in extra-trees (where ``M`` is the number of
91-
features in the data). The best results are also usually reached when setting
92-
``max_depth=None`` in combination with ``min_split=1`` (i.e., when fully
93-
developping the trees). Finally, note that bootstrap samples are used by default
94-
in random forests (``bootstrap=True``) while the default strategy is to use the
95-
original datasets for building extra-trees (``bootstrap=False``).
89+
>>> scores.mean() > 0.999
90+
True
91+
92+
The main parameters to adjust when using these methods is ``n_estimators``
93+
and ``max_features``. The former is the number of trees in the
94+
forest. The larger the better, but also the longer it will take to
95+
compute. The latter is the size of the random subsets of features to
96+
consider when splitting a node. The lower the greater the reduction of
97+
variance, but also the greater the increase in bias. Empiricial good
98+
default values are ``max_features=n_features`` in random forests, and
99+
``max_features=sqrt(n_features)`` in extra-trees (where ``n_features``
100+
is the number of features in the data). The best results are also
101+
usually reached when setting ``max_depth=None`` in combination with
102+
``min_split=1`` (i.e., when fully developping the trees).
103+
104+
Finally, note that bootstrap samples are used by default in random forests
105+
(``bootstrap=True``) while the default strategy is to use the original
106+
datasets for building extra-trees (``bootstrap=False``).
96107

97108
.. topic:: Examples:
98109

sklearn/ensemble/forest.py

Lines changed: 49 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,33 @@
1-
"""
2-
This module gathers forest of trees-based ensemble methods, including random
3-
forests and extra-trees.
1+
"""Forest of trees-based ensemble methods
2+
3+
Those methods include random forests and extremly randomized trees.
4+
5+
The module structure is the following:
6+
7+
- The ``Forest`` base class implements a common ``fit`` method for all
8+
the estimators the module. The ``fit`` method of the base ``Forest``
9+
class calls the ``fit`` method of each sub-estimator on random samples
10+
(with replacement, aka. bootstrap) of the training set.
11+
12+
The init of the sub-estimator is further delegated to the
13+
``BaseEnsemble`` constructor.
14+
15+
- The ``ForestClassifier`` and ``ForestRegressor`` base classes further
16+
implement the prediction logic by computing an average of the predicted
17+
outcomes of the sub-estimators.
18+
19+
- The ``RandomForestClassifier`` and ``RandomForestRegressor`` derived
20+
classes provide the user with concrete implementations of
21+
the forest ensemble method using classical, deterministic
22+
``DecisionTreeClassifier`` and ``DecisionTreeRegressor`` as default
23+
sub-estimator implementation.
24+
25+
- The ``ExtraTreesClassifier`` and ``ExtraTreesRegressor`` derived
26+
classes provide the user with concrete implementations of the
27+
forest ensemble method using the extremly randomized trees
28+
``ExtraTreeClassifier`` and ``ExtraTreeRegressor`` as default
29+
sub-estimator implementation.
30+
431
"""
532

633
# Authors: Gilles Louppe, Brian Holt
@@ -9,7 +36,7 @@
936
import numpy as np
1037

1138
from ..base import clone
12-
from ..base import BaseEstimator, ClassifierMixin, RegressorMixin
39+
from ..base import ClassifierMixin, RegressorMixin
1340
from ..tree import DecisionTreeClassifier, DecisionTreeRegressor, \
1441
ExtraTreeClassifier, ExtraTreeRegressor
1542
from ..utils import check_random_state
@@ -216,6 +243,10 @@ def predict(self, X):
216243
class RandomForestClassifier(ForestClassifier):
217244
"""A random forest classifier.
218245
246+
A random forest is a meta estimator that fits a number of classifical
247+
decision trees on various sub-samples of the dataset and use averaging
248+
to improve the predictive accuracy and control over-fitting.
249+
219250
Parameters
220251
----------
221252
base_estimator : object, optional (default=None)
@@ -275,6 +306,10 @@ def __init__(self, base_estimator=None,
275306
class RandomForestRegressor(ForestRegressor):
276307
"""A random forest regressor.
277308
309+
A random forest is a meta estimator that fits a number of classifical
310+
decision trees on various sub-samples of the dataset and use averaging
311+
to improve the predictive accuracy and control over-fitting.
312+
278313
Parameters
279314
----------
280315
base_estimator : object, optional (default=None)
@@ -334,6 +369,11 @@ def __init__(self, base_estimator=None,
334369
class ExtraTreesClassifier(ForestClassifier):
335370
"""An extra-trees classifier.
336371
372+
This class implements a meta estimator that fits a number of
373+
randomized decision trees (a.k.a. extra-trees) on various sub-samples
374+
of the dataset and use averaging to improve the predictive accuracy
375+
and control over-fitting.
376+
337377
Parameters
338378
----------
339379
base_estimator : object, optional (default=None)
@@ -394,6 +434,11 @@ def __init__(self, base_estimator=None,
394434
class ExtraTreesRegressor(ForestRegressor):
395435
"""An extra-trees regressor.
396436
437+
This class implements a meta estimator that fits a number of
438+
randomized decision trees (a.k.a. extra-trees) on various sub-samples
439+
of the dataset and use averaging to improve the predictive accuracy
440+
and control over-fitting.
441+
397442
Parameters
398443
----------
399444
base_estimator : object, optional (default=None)

0 commit comments

Comments
 (0)
0