10000 MNT Revert the deprecation of min_samples_leaf and min_weight_fractio… · scikit-learn/scikit-learn@79f5d14 · GitHub
[go: up one dir, main page]

Skip to content

Commit 79f5d14

Browse files
jnothmanrth
authored andcommitted
MNT Revert the deprecation of min_samples_leaf and min_weight_fraction_leaf (#11998)
1 parent 121dd5a commit 79f5d14

12 files changed

+229
-349
lines changed

doc/modules/ensemble.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ setting ``oob_score=True``.
218218
The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
219219
where :math:`M` is the number of trees and :math:`N` is the number of samples.
220220
In order to reduce the size of the model, you can change these parameters:
221-
``min_samples_split``, ``max_leaf_nodes`` and ``max_depth``.
221+
``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and ``min_samples_leaf``.
222222

223223
Parallelization
224224
---------------
@@ -393,7 +393,8 @@ The number of weak learners is controlled by the parameter ``n_estimators``. The
393393
the final combination. By default, weak learners are decision stumps. Different
394394
weak learners can be specified through the ``base_estimator`` parameter.
395395
The main parameters to tune to obtain good results are ``n_estimators`` and
396-
the complexity of the base estimators (e.g., its depth ``max_depth``).
396+
the complexity of the base estimators (e.g., its depth ``max_depth`` or
397+
minimum required number of samples to consider a split ``min_samples_split``).
397398

398399
.. topic:: Examples:
399400

doc/modules/tree.rst

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -330,18 +330,31 @@ Tips on practical use
330330
for each additional level the tree grows to. Use ``max_depth`` to control
331331
the size of the tree to prevent overfitting.
332332

333-
* Use ``min_samples_split`` to control the number of samples at a leaf node.
334-
A very small number will usually mean the tree will overfit, whereas a
335-
large number will prevent the tree from learning the data. If the sample
336-
size varies greatly, a float number can be used as percentage in this
337-
parameter. Note that ``min_samples_split`` can create arbitrarily
338-
small leaves.
333+
* Use ``min_samples_split`` or ``min_samples_leaf`` to ensure that multiple
334+
samples inform every decision in the tree, by controlling which splits will
335+
be considered. A very small number will usually mean the tree will overfit,
336+
whereas a large number will prevent the tree from learning the data. Try
337+
``min_samples_leaf=5`` as an initial value. If the sample size varies
338+
greatly, a float number can be used as percentage in these two parameters.
339+
While ``min_samples_split`` can create arbitrarily small leaves,
340+
``min_samples_leaf`` guarantees that each leaf has a minimum size, avoiding
341+
low-variance, over-fit leaf nodes in regression problems. For
342+
classification with few classes, ``min_samples_leaf=1`` is often the best
343+
choice.
339344

340345
* Balance your dataset before training to prevent the tree from being biased
341346
toward the classes that are dominant. Class balancing can be done by
342347
sampling an equal number of samples from each class, or preferably by
343348
normalizing the sum of the sample weights (``sample_weight``) for each
344-
class to the same value.
349+
class to the same value. Also note that weight-based pre-pruning criteria,
350+
such as ``min_weight_fraction_leaf``, will then be less biased toward
351+
dominant classes than criteria that are not aware of the sample weights,
352+
like ``min_samples_leaf``.
353+
354+
* If the samples are weighted, it will be easier to optimize the tree
355+
structure using weight-based pre-pruning criterion such as
356+
``min_weight_fraction_leaf``, which ensure that leaf nodes contain at least
357+
a fraction of the overall sum of the sample weights.
345358

346359
* All decision trees use ``np.float32`` arrays internally.
347360
If training data is not in this format, a copy of the dataset will be made.

doc/whats_new/v0.20.rst

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -343,12 +343,6 @@ Support for Python 3.3 has been officially dropped.
343343
while mask does not allow this functionality.
344344
:issue:`9524` by :user:`Guillaume Lemaitre <glemaitre>`.
345345

346-
- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
347-
tree-based ensembles are deprecated and will be removed (fixed to 1 and 0
348-
respectively) in version 0.22. These parameters were not effective for
349-
regularization and at worst would produce bad splits. :issue:`10773` by
350-
:user:`Bob Chen <lasagnaman>` and `Joel Nothman`_.
351-
352346
- |Fix| :class:`ensemble.BaseBagging` where one could not deterministically
353347
reproduce ``fit`` result using the object attributes when ``random_state``
354348
is set. :issue:`9723` by :user:`Guillaume Lemaitre <glemaitre>`.
@@ -1035,13 +1029,6 @@ Support for Python 3.3 has been officially dropped.
10351029
considered all samples to be of equal weight importance.
10361030
:issue:`11464` by :user:`John Stott <JohnStott>`.
10371031

1038-
- |API| The parameters ``min_samples_leaf`` and ``min_weight_fraction_leaf`` in
1039-
:class:`tree.DecisionTreeClassifier` and :class:`tree.DecisionTreeRegressor`
1040-
are deprecated and will be removed (fixed to 1 and 0 respectively) in version
1041-
0.22. These parameters were not effective for regularization and at worst
1042-
would produce bad splits. :issue:`10773` by :user:`Bob Chen <lasagnaman>`
1043-
and `Joel Nothman`_.
1044-
10451032

10461033
:mod:`sklearn.utils`
10471034
....................

examples/ensemble/plot_adaboost_hastie_10_2.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,11 @@
4343
X_test, y_test = X[2000:], y[2000:]
4444
X_train, y_train = X[:2000], y[:2000]
4545

46-
dt_stump = DecisionTreeClassifier(max_depth=1)
46+
dt_stump = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)
4747
dt_stump.fit(X_train, y_train)
4848
dt_stump_err = 1.0 - dt_stump.score(X_test, y_test)
4949

50-
dt = DecisionTreeClassifier(max_depth=9)
50+
dt = DecisionTreeClassifier(max_depth=9, min_samples_leaf=1)
5151
dt.fit(X_train, y_train)
5252
dt_err = 1.0 - dt.score(X_test, y_test)
5353

examples/ensemble/plot_gradient_boosting_oob.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@
5555

5656
# Fit classifier with out-of-bag estimates
5757
params = {'n_estimators': 1200, 'max_depth': 3, 'subsample': 0.5,
58-
'learning_rate': 0.01, 'random_state': 3}
58+
'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
5959
clf = ensemble.GradientBoostingClassifier(**params)
6060

6161
clf.fit(X_train, y_train)

examples/ensemble/plot_gradient_boosting_quantile.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,8 @@ def f(x):
4141

4242
clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
4343
n_estimators=250, max_depth=3,
44-
learning_rate=.1, min_samples_split=9)
44+
learning_rate=.1, min_samples_leaf=9,
45+
min_samples_split=9)
4546

4647
clf.fit(X, y)
4748

0 commit comments

Comments
 (0)
0