8000 DOC add clarification on random forest default params (#13248) · rth/scikit-learn@ab4b4ec · GitHub
[go: up one dir, main page]

Skip to content

Commit ab4b4ec

Browse files
abenbihiglemaitre
authored andcommitted
DOC add clarification on random forest default params (scikit-learn#13248)
1 parent 842df6f commit ab4b4ec

File tree

1 file changed

+39
-31
lines changed

1 file changed

+39
-31
lines changed

doc/modules/ensemble.rst

Lines changed: 39 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -128,16 +128,23 @@ Random Forests
128128
--------------
129129

130130
In random forests (see :class:`RandomForestClassifier` and
131-
:class:`RandomForestRegressor` classes), each tree in the ensemble is
132-
built from a sample drawn with replacement (i.e., a bootstrap sample)
133-
from the training set. In addition, when splitting a node during the
134-
construction of the tree, the split that is chosen is no longer the
135-
best split among all features. Instead, the split that is picked is the
136-
best split among a random subset of the features. As a result of this
137-
randomness, the bias of the forest usually slightly increases (with
138-
respect to the bias of a single non-random tree) but, due to averaging,
139-
its variance also decreases, usually more than compensating for the
140-
increase in bias, hence yielding an overall better model.
131+
:class:`RandomForestRegressor` classes), each tree in the ensemble is built
132+
from a sample drawn with replacement (i.e., a bootstrap sample) from the
133+
training set.
134+
135+
Furthermore, when splitting each node during the construction of a tree, the
136+
best split is found either from all input features or a random subset of size
137+
``max_features``. (See the :ref:`parameter tuning guidelines
138+
<random_forest_parameters>` for more details).
139+
140+
The purpose of these two sources of randomness is to decrease the variance of
141+
the forest estimator. Indeed, individual decision trees typically exhibit high
142+
variance and tend to overfit. The injected randomness in forests yield decision
143+
trees with somewhat decoupled prediction errors. By taking an average of those
144+
predictions, some errors can cancel out. Random forests achieve a reduced
145+
variance by combining diverse trees, sometimes at the cost of a slight increase
146+
in bias. In practice the variance reduction is often significant hence yielding
147+
an overall better model.
141148

142149
In contrast to the original publication [B2001]_, the scikit-learn
143150
implementation combines classifiers by averaging their probabilistic
@@ -188,30 +195,31 @@ in bias::
188195
:align: center
189196
:scale: 75%
190197

198+
.. _random_forest_parameters:
199+
191200
Parameters
192201
----------
193202

194-
The main parameters to adjust when using these methods is ``n_estimators``
195-
and ``max_features``. The former is the number of trees in the forest. The
196-
larger the better, but also the longer it will take to compute. In
197-
addition, note that results will stop getting significantly better
198-
beyond a critical number of trees. The latter is the size of the random
199-
subsets of features to consider when splitting a node. The lower the
200-
greater the reduction of variance, but also the greater the increase in
201-
bias. Empirical good default values are ``max_features=n_features``
202-
for regression problems, and ``max_features=sqrt(n_features)`` for
203-
classification tasks (where ``n_features`` is the number of features
204-
in the data). Good results are often achieved when setting ``max_depth=None``
205-
in combination with ``min_samples_split=2`` (i.e., when fully developing the
206-
trees). Bear in mind though that these values are usually not optimal, and
207-
might result in models that consume a lot of RAM. The best parameter values
208-
should always be cross-validated. In addition, note that in random forests,
209-
bootstrap samples are used by default (``bootstrap=True``)
210-
while the default strategy for extra-trees is to use the whole dataset
211-
(``bootstrap=False``).
212-
When using bootstrap sampling the generalization accuracy can be estimated
213-
on the left out or out-of-bag samples. This can be enabled by
214-
setting ``oob_score=True``.
203+
The main parameters to adjust when using these methods is ``n_estimators`` and
204+
``max_features``. The former is the number of trees in the forest. The larger
205+
the better, but also the longer it will take to compute. In addition, note that
206+
results will stop getting significantly better beyond a critical number of
207+
trees. The latter is the size of the random subsets of features to consider
208+
when splitting a node. The lower the greater the reduction of variance, but
209+
also the greater the increase in bias. Empirical good default values are
210+
``max_features=None`` (always considering all features instead of a random
211+
subset) for regression problems, and ``max_features="sqrt"`` (using a random
212+
subset of size ``sqrt(n_features)``) for classification tasks (where
213+
``n_features`` is the number of features in the data). Good results are often
214+
achieved when setting ``max_depth=None`` in combination with
215+
``min_samples_split=2`` (i.e., when fully developing the trees). Bear in mind
216+
though that these values are usually not optimal, and might result in models
217+
that consume a lot of RAM. The best parameter values should always be
218+
cross-validated. In addition, note that in random forests, bootstrap samples
219+
are used by default (``bootstrap=True``) while the default strategy for
220+
extra-trees is to use the whole dataset (``bootstrap=False``). When using
221+
bootstrap sampling the generalization accuracy can be estimated on the left out
222+
or out-of-bag samples. This can be enabled by setting ``oob_score=True``.
215223

216224
.. note::
217225

0 commit comments

Comments
 (0)
0