8000 [MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are … · Sundrique/scikit-learn@aba6564 · GitHub
[go: up one dir, main page]

Skip to content

Commit aba6564

Browse files
nelson-liuSundrique
authored andcommitted
[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided (scikit-learn#7301)
* fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior
1 parent 0550133 commit aba6564

File tree

5 files changed

+138
-21
lines changed

5 files changed

+138
-21
lines changed

doc/whats_new.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,13 @@ Enhancements
2424
Bug fixes
2525
.........
2626

27+
- The ``min_weight_fraction_leaf`` parameter of tree-based classifiers and
28+
regressors now assumes uniform sample weights by default if the
29+
``sample_weight`` argument is not passed to the ``fit`` function.
30+
Previously, the parameter was silently ignored. (`#7301
31+
<https://github.com/scikit-learn/scikit-learn/pull/7301>`_) by `Nelson
32+
Liu`_.
33+
2734
.. _changes_0_18:
2835

2936
Version 0.18

sklearn/ensemble/forest.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -807,8 +807,9 @@ class RandomForestClassifier(ForestClassifier):
807807
Added float values for percentages.
808808
809809
min_weight_fraction_leaf : float, optional (default=0.)
810-
The minimum weighted fraction of the input samples required to be at a
811-
leaf node.
810+
The minimum weighted fraction of the sum total of weights (of all
811+
the input samples) required to be at a leaf node. Samples have
812+
equal weight when sample_weight is not provided.
812813
813814
max_leaf_nodes : int or None, optional (default=None)
814815
Grow trees with ``max_leaf_nodes`` in best-first fashion.
@@ -1018,8 +1019,9 @@ class RandomForestRegressor(ForestRegressor):
10181019
Added float values for percentages.
10191020
10201021
min_weight_fraction_leaf : float, optional (default=0.)
1021-
The minimum weighted fraction of the input samples required to be at a
1022-
leaf node.
1022+
The minimum weighted fraction of the sum total of weights (of all
1023+
the input samples) required to be at a leaf node. Samples have
1024+
equal weight when sample_weight is not provided.
10231025
10241026
max_leaf_nodes : int or None, optional (default=None)
10251027
Grow trees with ``max_leaf_nodes`` in best-first fashion.
@@ -1189,8 +1191,9 @@ class ExtraTreesClassifier(ForestClassifier):
11891191
Added float values for percentages.
11901192
11911193
min_weight_fraction_leaf : float, optional (default=0.)
1192-
The minimum weighted fraction of the input samples required to be at a
1193-
leaf node.
1194+
The minimum weighted fraction of the sum total of weights (of all
1195+
the input samples) required to be at a leaf node. Samples have
1196+
equal weight when sample_weight is not provided.
11941197
11951198
max_leaf_nodes : int or None, optional (default=None)
11961199
Grow trees with ``max_leaf_nodes`` in best-first fashion.
@@ -1399,8 +1402,9 @@ class ExtraTreesRegressor(ForestRegressor):
13991402
Added float values for percentages.
14001403
14011404
min_weight_fraction_leaf : float, optional (default=0.)
1402-
The minimum weighted fraction of the input samples required to be at a
1403-
leaf node.
1405+
The minimum weighted fraction of the sum total of weights (of all
1406+
the input samples) required to be at a leaf node. Samples have
1407+
equal weight when sample_weight is not provided.
14041408
14051409
max_leaf_nodes : int or None, optional (default=None)
14061410
Grow trees with ``max_leaf_nodes`` in best-first fashion.
@@ -1556,8 +1560,9 @@ class RandomTreesEmbedding(BaseForest):
15561560
Added float values for percentages.
15571561
15581562
min_weight_fraction_leaf : float, optional (default=0.)
1559-
The minimum weighted fraction of the input samples required to be at a
1560-
leaf node.
1563+
The minimum weighted fraction of the sum total of weights (of all
1564+
the input samples) required to be at a leaf node. Samples have
1565+
equal weight when sample_weight is not provided.
15611566
15621567
max_leaf_nodes : int or None, optional (default=None)
15631568
Grow trees with ``max_leaf_nodes`` in best-first fashion.

sklearn/ensemble/gradient_boosting.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1330,8 +1330,9 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
13301330
Added float values for percentages.
13311331
13321332
min_weight_fraction_leaf : float, optional (default=0.)
1333-
The minimum weighted fraction of the input samples required to be at a
1334-
leaf node.
1333+
The minimum weighted fraction of the sum total of weights (of all
1334+
the input samples) required to be at a leaf node. Samples have
1335+
equal weight when sample_weight is not provided.
13351336
13361337
subsample : float, optional (default=1.0)
13371338
The fraction of samples to be used for fitting the individual base
@@ -1698,8 +1699,9 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
16981699
Added float values for percentages.
16991700
17001701
min_weight_fraction_leaf : float, optional (default=0.)
1701-
The minimum weighted fraction of the input samples required to be at a
1702-
leaf node.
1702+
The minimum weighted fraction of the sum total of weights (of all
1703+
the input samples) required to be at a leaf node. Samples have
1704+
equal weight when sample_weight is not provided.
17031705
17041706
subsample : float, optional (default=1.0)
17051707
The fraction of samples to be used for fitting the individual base

sklearn/tree/tests/test_tree.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -670,6 +670,30 @@ def check_min_weight_fraction_leaf(name, datasets, sparse=False):
670670
"min_weight_fraction_leaf={1}".format(
671671
name, est.min_weight_fraction_leaf))
672672

673+
# test case with no weights passed in
674+
total_weight = X.shape[0]
675+
676+
for max_leaf_nodes, frac in product((None, 1000), np.linspace(0, 0.5, 6)):
677+
est = TreeEstimator(min_weight_fraction_leaf=frac,
678+
max_leaf_nodes=max_leaf_nodes,
679+
random_state=0)
680+
est.fit(X, y)
681+
682+
if sparse:
683+
out = est.tree_.apply(X.tocsr())
684+
else:
685+
out = est.tree_.apply(X)
686+
687+
node_weights = np.bincount(out)
688+
# drop inner nodes
689+
leaf_weights = node_weights[node_weights != 0]
690+
assert_greater_equal(
691+
np.min(leaf_weights),
692+
total_weight * est.min_weight_fraction_leaf,
693+
"Failed with {0} "
694+
"min_weight_fraction_leaf={1}".format(
695+
name, est.min_weight_fraction_leaf))
696+
673697

674698
def test_min_weight_fraction_leaf():
675699
# Check on dense input
@@ -681,6 +705,82 @@ def test_min_weight_fraction_leaf():
681705
yield check_min_weight_fraction_leaf, name, "multilabel", True
682706

683707

708+
def check_min_weight_fraction_leaf_with_min_samples_leaf(name, datasets,
709+
sparse=False):
710+
"""Test the interaction between min_weight_fraction_leaf and min_samples_leaf
711+
when sample_weights is not provided in fit."""
712+
if sparse:
713+
X = DATASETS[datasets]["X_sparse"].astype(np.float32)
714+
else:
715+
X = DATASETS[datasets]["X"].astype(np.float32)
716+
y = DATASETS[datasets]["y"]
717+
718+
total_weight = X.shape[0]
719+
TreeEstimator = ALL_TREES[name]
720+
for max_leaf_nodes, frac in product((None, 1000), np.linspace(0, 0.5, 3)):
721+
# test integer min_samples_leaf
722+
est = TreeEstimator(min_weight_fraction_leaf=frac,
723+
max_leaf_nodes=max_leaf_nodes,
724+
min_samples_leaf=5,
725+
random_state=0)
726+
est.fit(X, y)
727+
728+
if sparse:
729+
out = est.tree_.apply(X.tocsr())
730+
else:
731+
out = est.tree_.apply(X)
732+
733+
node_weights = np.bincount(out)
734+
# drop inner nodes
735+
leaf_weights = node_weights[node_weights != 0]
736+
assert_greater_equal(
737+
np.min(leaf_weights),
738+
max((total_weight *
739+
est.min_weight_fraction_leaf), 5),
740+
"Failed with {0} "
741+
"min_weight_fraction_leaf={1}, "
742+
"min_samples_leaf={2}".format(name,
743+
est.min_weight_fraction_leaf,
744+
est.min_samples_leaf))
745+
for max_leaf_nodes, frac in product((None, 1000), np.linspace(0, 0.5, 3)):
746+
# test float min_samples_leaf
747+
est = TreeEstimator(min_weight_fraction_leaf=frac,
748+
max_leaf_nodes=max_leaf_nodes,
749+
min_samples_leaf=.1,
750+
random_state=0)
751+
est.fit(X, y)
752+
753+
if sparse:
754+
out = est.tree_.apply(X.tocsr())
755+
else:
756+
out = est.tree_.apply(X)
757+
758+
node_weights = np.bincount(out)
759+
# drop inner nodes
760+
leaf_weights = node_weights[node_weights != 0]
761+
assert_greater_equal(
762+
np.min(leaf_weights),
763+
max((total_weight * est.min_weight_fraction_leaf),
764+
(total_weight * est.min_samples_leaf)),
765+
"Failed with {0} "
766+
"min_weight_fraction_leaf={1}, "
767+
"min_samples_leaf={2}".format(name,
768+
est.min_weight_fraction_leaf,
769+
est.min_samples_leaf))
770+
771+
772+
def test_min_weight_fraction_leaf_with_min_samples_leaf():
773+
# Check on dense input
774+
for name in ALL_TREES:
775+
yield (check_min_weight_fraction_leaf_with_min_samples_leaf,
776+
name, "iris")
777+
778+
# Check on sparse input
779+
for name in SPARSE_TREES:
780+
yield (check_min_weight_fraction_leaf_with_min_samples_leaf,
781+
name, "multilabel", True)
782+
783+
684784
def test_min_impurity_split():
685785
# test if min_impurity_split creates leaves with impurity
686786
# [0, min_impurity_split) when min_samples_leaf = 1 and

sklearn/tree/tree.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -301,11 +301,12 @@ def fit(self, X, y, sample_weight=None, check_input=True,
301301
sample_weight = expanded_class_weight
302302

303303
# Set min_weight_leaf from min_weight_fraction_leaf
304-
if self.min_weight_fraction_leaf != 0. and sample_weight is not None:
304+
if sample_weight is None:
305305
min_weight_leaf = (self.min_weight_fraction_leaf *
306-
np.sum(sample_weight))
306+
n_samples)
307307
else:
308-
min_weight_leaf = 0.
308+
min_weight_leaf = (self.min_weight_fraction_leaf *
309+
np.sum(sample_weight))
309310

310311
if self.min_impurity_split < 0.:
311312
raise ValueError("min_impurity_split must be greater than or equal "
@@ -592,8 +593,9 @@ class DecisionTreeClassifier(BaseDecisionTree, ClassifierMixin):
592593
Added float values for percentages.
593594
594595
min_weight_fraction_leaf : float, optional (default=0.)
595-
The minimum weighted fraction of the input samples required to be at a
596-
leaf node.
596+
The minimum weighted fraction of the sum total of weights (of all
597+
the input samples) required to be at a leaf node. Samples have
598+
equal weight when sample_weight is not provided.
597599
598600
max_leaf_nodes : int or None, optional (default=None)
599601
Grow a tree with ``max_leaf_nodes`` in best-first fashion.
@@ -862,8 +864,9 @@ class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):
862864
Added float values for percentages.
863865
864866
min_weight_fraction_leaf : float, optional (default=0.)
865-
The minimum weighted fraction of the input samples required to be at a
866-
leaf node.
867+
The minimum weighted fraction of the sum total of weights (of all
868+
the input samples) required to be at a leaf node. Samples have
869+
equal weight when sample_weight is not provided.
867870
868871
max_leaf_nodes : int or None, optional (default=None)
869872
Grow a tree with ``max_leaf_nodes`` in best-first fashion.

0 commit comments

Comments
 (0)
0