@@ -34,7 +34,7 @@ Forests of randomized trees
34
34
The ``sklearn.ensemble `` module includes two averaging algorithms based on
35
35
randomized :ref: `decision trees <tree >`: the RandomForest algorithm and the
36
36
Extra-Trees method. Both algorithms are perturb-and-combine techniques
37
- specifically designed for trees.
37
+ specifically designed for trees::
38
38
39
39
>>> from sklearn.ensemble import RandomForestClassifier
40
40
>>> X = [[0, 0], [1, 1]]
@@ -60,39 +60,50 @@ features is used, but instead of looking for the most discriminative thresholds,
60
60
thresholds are drawn at random for each candidate feature and the best of these
61
61
randomly-generated thresholds is picked as the splitting rule. This usually
62
62
allows to reduce the variance of the model a bit more, at the expense of a
63
- slightly greater increase in bias.
63
+ slightly greater increase in bias::
64
64
65
65
>>> from sklearn.cross_validation import cross_val_score
66
66
>>> from sklearn.datasets import make_blobs
67
67
>>> from sklearn.ensemble import RandomForestClassifier
68
68
>>> from sklearn.ensemble import ExtraTreesClassifier
69
69
>>> from sklearn.tree import DecisionTreeClassifier
70
- >>> X, y = make_blobs(n_samples = 10000 , n_features = 10 , centers = 100 )
71
- >>> clf = DecisionTreeClassifier(max_depth = None , min_split = 1 )
70
+
71
+ >>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
72
+ ... random_state=0)
73
+
74
+ >>> clf = DecisionTreeClassifier(max_depth=None, min_split=1,
75
+ ... random_state=0)
72
76
>>> scores = cross_val_score(clf, X, y)
73
- >>> scores.mean()
74
- 0.97609967955403809
75
- >>> clf = RandomForestClassifier(n_estimators = 10 , max_depth = None , min_split = 1 )
77
+ >>> scores.mean() # doctest: +ELLIPSIS
78
+ 0.978...
79
+
80
+ >>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
81
+ ... min_split=1, random_state=0)
76
82
>>> scores = cross_val_score(clf, X, y)
77
- >>> scores.mean()
78
- 0.99510028987301846
79
- >>> clf = ExtraTreesClassifier(n_estimators = 10 , max_depth = None , min_split = 1 )
83
+ >>> scores.mean() # doctest: +ELLIPSIS
84
+ 0.992...
85
+
86
+ >>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
87
+ ... min_split=1, random_state=0)
80
88
>>> scores = cross_val_score(clf, X, y)
81
- >>> scores.mean()
82
- 1.0
83
-
84
- The main parameters to adjust when using these methods is ``n_estimators `` and
85
- ``max_features ``. The former is the number of trees in the forest. The larger
86
- the better, but also the longer it will take to compute. The latter is the size
87
- of the random subsets of features to consider when splitting a node. The lower
88
- the greater the reduction of variance, but also the greater the increase in
89
- bias. Empiricial good default values are ``max_features=M `` in random forests,
90
- and ``max_features=sqrt(M) `` in extra-trees (where ``M `` is the number of
91
- features in the data). The best results are also usually reached when setting
92
- ``max_depth=None `` in combination with ``min_split=1 `` (i.e., when fully
93
- developping the trees). Finally, note that bootstrap samples are used by default
94
- in random forests (``bootstrap=True ``) while the default strategy is to use the
95
- original datasets for building extra-trees (``bootstrap=False ``).
89
+ >>> scores.mean() > 0.999
90
+ True
91
+
92
+ The main parameters to adjust when using these methods is ``n_estimators ``
93
+ and ``max_features ``. The former is the number of trees in the
94
+ forest. The larger the better, but also the longer it will take to
95
+ compute. The latter is the size of the random subsets of features to
96
+ consider when splitting a node. The lower the greater the reduction of
97
+ variance, but also the greater the increase in bias. Empiricial good
98
+ default values are ``max_features=n_features `` in random forests, and
99
+ ``max_features=sqrt(n_features) `` in extra-trees (where ``n_features ``
100
+ is the number of features in the data). The best results are also
101
+ usually reached when setting ``max_depth=None `` in combination with
102
+ ``min_split=1 `` (i.e., when fully developping the trees).
103
+
104
+ Finally, note that bootstrap samples are used by default in random forests
105
+ (``bootstrap=True ``) while the default strategy is to use the original
106
+ datasets for building extra-trees (``bootstrap=False ``).
96
107
97
108
.. topic :: Examples:
98
109
0 commit comments