8000 Deprecate min_samples_leaf and min_weight_fraction_leaf (#11870) · scikit-learn/scikit-learn@2fe58e5 · GitHub
[go: up one dir, main page]

Skip to content

Commit 2fe58e5

Browse files
jnothmanrth
authored andcommitted
Deprecate min_samples_leaf and min_weight_fraction_leaf (#11870)
1 parent ac41ccf commit 2fe58e5

File tree

13 files changed

+327
-164
lines changed

13 files changed

+327
-164
lines changed

doc/modules/ensemble.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ setting ``oob_score=True``.
218218
The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
219219
where :math:`M` is the number of trees and :math:`N` is the number of samples.
220220
In order to reduce the size of the model, you can change these parameters:
221-
``min_samples_split``, ``min_samples_leaf``, ``max_leaf_nodes`` and ``max_depth``.
221+
``min_samples_split``, ``max_leaf_nodes`` and ``max_depth``.
222222

223223
Parallelization
224224
---------------
@@ -393,9 +393,7 @@ The number of weak learners is controlled by the parameter ``n_estimators``. The
393393
the final combination. By default, weak learners are decision stumps. Different
394394
weak learners can be specified through the ``base_estimator`` parameter.
395395
The main parameters to tune to obtain good results are ``n_estimators`` and
396-
the complexity of the base estimators (e.g., its depth ``max_depth`` or
397-
minimum required number of samples at a leaf ``min_samples_leaf`` in case of
398-
decision trees).
396+
the complexity of the base estimators (e.g., its depth ``max_depth``).
399397

400398
.. topic:: Examples:
401399

doc/modules/tree.rst

Lines changed: 7 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -330,29 +330,18 @@ Tips on practical use
330330
for each additional level the tree grows to. Use ``max_depth`` to control
331331
the size of the tree to prevent overfitting.
332332

333-
* Use ``min_samples_split`` or ``min_samples_leaf`` to control the number of
334-
samples at a leaf node. A very small number will usually mean the tree
335-
will overfit, whereas a large number will prevent the tree from learning
336-
the data. Try ``min_samples_leaf=5`` as an initial value. If the sample size
337-
varies greatly, a float number can be used as percentage in these two parameters.
338-
The main difference between the two is that ``min_samples_leaf`` guarantees
339-
a minimum number of samples in a leaf, while ``min_samples_split`` can
340-
create arbitrary small leaves, though ``min_samples_split`` is more common
341-
in the literature.
333+
* Use ``min_samples_split`` to control the number of samples at a leaf node.
334+
A very small number will usually mean the tree will overfit, whereas a
335+
large number will prevent the tree from learning the data. If the sample
336+
size varies greatly, a float number can be used as percentage in this
337+
parameter. Note that ``min_samples_split`` can create arbitrarily
338+
small leaves.
342339

343340
* Balance your dataset EDBE before training to prevent the tree from being biased
344341
toward the classes that are dominant. Class balancing can be done by
345342
sampling an equal number of samples from each class, or preferably by
346343
normalizing the sum of the sample weights (``sample_weight``) for each
347-
class to the same value. Also note that weight-based pre-pruning criteria,
348-
such as ``min_weight_fraction_leaf``, will then be less biased toward
349-
dominant classes than criteria that are not aware of the sample weights,
350-
like ``min_samples_leaf``.
351-
352-
* If the samples are weighted, it will be easier to optimize the tree
353-
structure using weight-based pre-pruning criterion such as
354-
``min_weight_fraction_leaf``, which ensure that leaf nodes contain at least
355-
a fraction of the overall sum of the sample weights.
344+
class to the same value.
356345

357346
* All decision trees use ``np.float32`` arrays internally.
358347
If training data is not in this format, a copy of the dataset will be made.

doc/whats_new/v0.20.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,12 @@ Support for Python 3.3 has been officially dropped.
325325
while mask does not allow this functionality.
326326
:issue:`9524` by :user:`Guillaume Lemaitre <glemaitre>`.
327327

328+
- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
329+
tree-based ensembles are deprecated and will be removed (fixed to 1 and 0
330+
respectively) in version 0.22. These parameters were not effective for
331+
regularization and at worst would produce bad splits. :issue:`10773` by
332+
:user:`Bob Chen <lasagnaman>` and `Joel Nothman`_.
333+
328334
- |Fix| :class:`ensemble.BaseBagging` where one could not deterministically
329335
reproduce ``fit`` result using the object attributes when ``random_state``
330336
is set. :issue:`9723` by :user:`Guillaume Lemaitre <glemaitre>`.
@@ -1005,6 +1011,13 @@ Support for Python 3.3 has been officially dropped.
10051011
considered all samples to be of equal weight importance.
10061012
:issue:`11464` by :user:`John Stott <JohnStott>`.
10071013

1014+
- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
1015+
:class:`tree.DecisionTreeClassifier` and :class:`tree.DecisionTreeRegressor`
1016+
are deprecated and will be removed (fixed to 1 and 0 respectively) in version
1017+
0.22. These parameters were not effective for regularization and at worst
1018+
would produce bad splits. :issue:`10773` by :user:`Bob Chen <lasagnaman>`
1019+
and `Joel Nothman`_.
1020+
10081021

10091022
:mod:`sklearn.utils`
10101023
....................

examples/ensemble/plot_adaboost_hastie_10_2.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,11 @@
4343
X_test, y_test = X[2000:], y[2000:]
4444
X_train, y_train = X[:2000], y[:2000]
4545

46-
dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
46+
dt_stump = DecisionTreeClassifier(max_depth=1)
4747
dt_stump.fit(X_train, y_train)
4848
dt_stump_err = 1.0 - dt_stump.score(X_test, y_test)
4949

50-
dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)
50+
dt = DecisionTreeClassifier(max_depth=9)
5151
dt.fit(X_train, y_train)
5252
dt_err = 1.0 - dt.score(X_test, y_test)
5353

examples/ensemble/plot_gradient_boosting_oob.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@
5555

5656
# Fit classifier with out-of-bag estimates
5757
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
58-
'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
58+
'learning_rate': 0.01, 'random_state': 3}
5959
clf = ensemble.GradientBoostingClassifier(**params)
6060

6161
clf.fit(X_train, y_train)

examples/ensemble/plot_gradient_boosting_quantile.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,7 @@ def f(x):
4141

4242
clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
4343
n_estimators=250, max_depth=3,
44-
learning_rate=.1, min_samples_leaf=9,
45-
min_samples_split=9)
44+
learning_rate=.1, min_samples_split=9)
4645

4746
clf.fit(X, y)
4847

examples/model_selection/plot_randomized_search.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,6 @@ def report(results, n_top=3):
5555
param_dist = {"max_depth": [3, None],
5656
"max_features": sp_randint(1, 11),
5757
"min_samples_split": sp_randint(2, 11),
58-
"min_samples_leaf": sp_randint(1, 11),
5958
"bootstrap": [True, False],
6059
"criterion": ["gini", "entropy"]}
6160

@@ -74,7 +73,6 @@ def report(results, n_top=3):
7473
param_grid = {"max_depth": [3, None],
7574
"max_features": [1, 3, 10],
7675
"min_samples_split": [2, 3, 10],
77-
"min_samples_leaf": [1, 3, 10],
7876
"bootstrap": [True, False],
7977
"criterion": ["gini", "entropy"]}
8078

0 commit comments

Comments
 (0)
0