@@ -128,16 +128,23 @@ Random Forests
128
128
--------------
129
129
130
130
In random forests (see :class: `RandomForestClassifier ` and
131
- :class: `RandomForestRegressor ` classes), each tree in the ensemble is
132
- built from a sample drawn with replacement (i.e., a bootstrap sample)
133
- from the training set. In addition, when splitting a node during the
134
- construction of the tree, the split that is chosen is no longer the
135
- best split among all features. Instead, the split that is picked is the
136
- best split among a random subset of the features. As a result of this
137
- randomness, the bias of the forest usually slightly increases (with
138
- respect to the bias of a single non-random tree) but, due to averaging,
139
- its variance also decreases, usually more than compensating for the
140
- increase in bias, hence yielding an overall better model.
131
+ :class: `RandomForestRegressor ` classes), each tree in the ensemble is built
132
+ from a sample drawn with replacement (i.e., a bootstrap sample) from the
133
+ training set.
134
+
135
+ Furthermore, when splitting each node during the construction of a tree, the
136
+ best split is found either from all input features or a random subset of size
137
+ ``max_features ``. (See the :ref: `parameter tuning guidelines
138
+ <random_forest_parameters>` for more details).
139
+
140
+ The purpose of these two sources of randomness is to decrease the variance of
141
+ the forest estimator. Indeed, individual decision trees typically exhibit high
142
+ variance and tend to overfit. The injected randomness in forests yield decision
143
+ trees with somewhat decoupled prediction errors. By taking an average of those
144
+ predictions, some errors can cancel out. Random forests achieve a reduced
145
+ variance by combining diverse trees, sometimes at the cost of a slight increase
146
+ in bias. In practice the variance reduction is often significant hence yielding
147
+ an overall better model.
141
148
142
149
In contrast to the original publication [B2001 ]_, the scikit-learn
143
150
implementation combines classifiers by averaging their probabilistic
@@ -188,30 +195,31 @@ in bias::
188
195
:align: center
189
196
:scale: 75%
190
197
198
+ .. _random_forest_parameters :
199
+
191
200
Parameters
192
201
----------
193
202
194
- The main parameters to adjust when using these methods is ``n_estimators ``
195
- and ``max_features ``. The former is the number of trees in the forest. The
196
- larger the better, but also the longer it will take to compute. In
197
- addition, note that results will stop getting significantly better
198
- beyond a critical number of trees. The latter is the size of the random
199
- subsets of features to consider when splitting a node. The lower the
200
- greater the reduction of variance, but also the greater the increase in
201
- bias. Empirical good default values are ``max_features=n_features ``
202
- for regression problems, and ``max_features=sqrt(n_features) `` for
203
- classification tasks (where ``n_features `` is the number of features
204
- in the data). Good results are often achieved when setting ``max_depth=None ``
205
- in combination with ``min_samples_split=2 `` (i.e., when fully developing the
206
- trees). Bear in mind though that these values are usually not optimal, and
207
- might result in models that consume a lot of RAM. The best parameter values
208
- should always be cross-validated. In addition, note that in random forests,
209
- bootstrap samples are used by default (``bootstrap=True ``)
210
- while the default strategy for extra-trees is to use the whole dataset
211
- (``bootstrap=False ``).
212
- When using bootstrap sampling the generalization accuracy can be estimated
213
- on the left out or out-of-bag samples. This can be enabled by
214
- setting ``oob_score=True ``.
203
+ The main parameters to adjust when using these methods is ``n_estimators `` and
204
+ ``max_features ``. The former is the number of trees in the forest. The larger
205
+ the better, but also the longer it will take to compute. In addition, note that
206
+ results will stop getting significantly better beyond a critical number of
207
+ trees. The latter is the size of the random subsets of features to consider
208
+ when splitting a node. The lower the greater the reduction of variance, but
209
+ also the greater the increase in bias. Empirical good default values are
210
+ ``max_features=None `` (always considering all features instead of a random
211
+ subset) for regression problems, and ``max_features="sqrt" `` (using a random
212
+ subset of size ``sqrt(n_features) ``) for classification tasks (where
213
+ ``n_features `` is the number of features in the data). Good results are often
214
+ achieved when setting ``max_depth=None `` in combination with
215
+ ``min_samples_split=2 `` (i.e., when fully developing the trees). Bear in mind
216
+ though that these values are usually not optimal, and might result in models
217
+ that consume a lot of RAM. The best parameter values should always be
218
+ cross-validated. In addition, note that in random forests, bootstrap samples
219
+ are used by default (``bootstrap=True ``) while the default strategy for
220
+ extra-trees is to use the whole dataset (``bootstrap=False ``). When using
221
+ bootstrap sampling the generalization accuracy can be estimated on the left out
222
+ or out-of-bag samples. This can be enabled by setting ``oob_score=True ``.
215
223
216
224
.. note ::
217
225
0 commit comments