-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Duplicate logic in Decision Tree Building #7338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Additionally, there's a separate issue of how |
Input from @glouppe would be most welcome. |
Doesn't setting |
I suspect that check is redundant and never used, as long as the splitters correctly ensure that no split results in |
Indeed, the main issue is that given two trees with only |
I would hope the condition there regarding min_weight_leaf is never On 5 September 2016 at 00:06, Nelson Liu notifications@github.com wrote:
|
well this is the code i'm running:
In est_weight, there are definitely cases where In est_sample, Regardless, it shows that these two if statements are not executing the same thing at all, which leads to (incorrect) growth of different trees. |
opened a PR at #7340 so you can see the changes i am proposing. |
But aren't these cases eliminated by the Splitter checks on |
No the |
Yes, but if one is marked as a leaf node that has weighted_n_node_samples < On 5 September 2016 at 09:27, Nelson Liu notifications@github.com wrote:
|
I will admit I'm not familiar enough with the code to know when splits are On 5 September 2016 at 09:30, Joel Nothman joel.nothman@gmail.com wrote:
|
If things are working as I suspect, then this condition regarding |
Yeah, that makes sense. I feel like it is logical for us to expect them to be identical, though (in the sense that I think this would be the correct behavior) ? Considering I'd think that building two separate trees, one with a set |
Their effect as regularisers is identical, but |
Fair enough, I'll remove the test that sees whether the two parameters have the same effect in #7301, since the trees are different due to randomization. Thanks for the clarity @jnothman |
The clarity came from collaboration. Thank you.q |
the random state is only used for tie breaking. Are the results also different for data with no ties? |
Here are visualizations of the following trees: # X and y are the "multilabel" test set in test_tree.py
for max_leaf_nodes, frac in product([None], [.2]):
# test that min_samples_leaf and min_weight_fraction_leaf
# yield the same results when sample_weight is None.
est_weight = ExtraTreeRegressor(max_leaf_nodes=max_leaf_nodes,
min_weight_fraction_leaf=frac,
random_state=0)
est_sample = ExtraTreeRegressor(max_leaf_nodes=max_leaf_nodes,
min_samples_leaf=frac,
random_state=0)
est_weight.fit(X, y)
est_sample.fit(X, y)
export_graphviz(est_weight, "weight.dot")
export_graphviz(est_sample, "sample.dot")
assert_tree_equal(est_weight.tree_, est_sample.tree_,
"{0} with equal values of min_samples_leaf "
"and min_weight_fraction_leaf "
"are unequal with parameter value "
"{1}".format(name, frac)) the output:
|
is that tie breaking? that's not obvious to me |
no, I don't think there are any ties... |
this question is still open, right? |
Closed in #7441 |
Uh oh!
There was an error while loading. Please reload this page.
Description
I've been working on #7301, and I found some seemingly duplicate logic in the decision tree code. At the least, it's something that requires more documentation; at the most, it's a bug.
So I was poking around, trying to figure out why I couldn't get equal trees for uniform sample weights when i set
min_weight_fraction_leaf
andmin_samples_leaf
to the same float value. I saw that the two built trees actually had radically different values ofmin_samples_split
, despite me not setting this value at all. In the code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L229), it shows thatmin_samples_split
is set to themax(min_samples_split, 2 * min_samples_leaf)
. This isn't documented at all, and should be if we decide to keep this statement.Now this logic makes sense to me, as you can't make a split if there's no way to split it in such a manner that each leaf has the minimum number of samples. However, isn't this sort of logic encapsulated when we're building the tree? Namely, I'm thinking of the
is_leaf
conditional, which determines whether to split on a node or not (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L219):I think the statement
(n_node_samples < 2 * min_samples_leaf)
has the same effect as settingmin_samples_split
explicitly to the max of itself or2*min_samples_leaf
. Indeed, removing the max call does not cause any tests to fail.Should we be removing the explicit call to max? It seems like a piece of duplicate logic, but maybe I'm missing something. ping @glouppe @raghavrv @jmschrei / anyone else.
The text was updated successfully, but these errors were encountered: