10000 docs state min_samples_leaf reduces size of the tree · Issue #10773 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

docs state min_samples_leaf reduces size of the tree #10773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Mar 7, 2018 · 14 comments
Closed

docs state min_samples_leaf reduces size of the tree #10773

amueller opened this issue Mar 7, 2018 · 14 comments
Labels
Bug Documentation Easy Well-defined and straightforward way to resolve help wanted

Comments

@amueller
Copy link
Member
amueller commented Mar 7, 2018

The docs state that min_samples_leaf reduces the size of the tree. This is wrong. The size of the tree is not influenced by min_samples_leaf. See the discussion at #8399.

I would actually vote to deprecate the parameter, empirically 1 is always best.

I would have like to have a parameter that actually implements what I meant when I proposed min_samples_leaf but I'm not sure it's worth it. It would make for nicer trees than min_samples_split imho but I don't have the data to proof it ;)

@amueller amueller added Bug Easy Well-defined and straightforward way to resolve Documentation help wanted labels Mar 7, 2018
@jnothman
Copy link
Member
jnothman commented Mar 7, 2018 via email

@FarahSaeed
Copy link
Contributor

Hi ... Can I give this a shot if deprecation is intended?

@jnothman
Copy link
Member
jnothman commented Mar 10, 2018 via email

@amueller
Copy link
Member Author

I think deprecation is a good step.

I'd kind of like to introduce the "correct" parameter, which is what I intended and which is what's implemented in R's ranger (random forests) and C4.5. But I'm not sure how to best do that and if it's worth it. I think it would be strictly better than min_samples_split.

@amueller
Copy link
Member Author

@jnothman it "regularizes" but it doesn't reduce the size of the tree.

@jnothman
Copy link
Member
jnothman commented Mar 13, 2018 via email

@gamazeps
Copy link
Contributor

@FarahSaeed Are you still working on this ? If not I'd like to take a look at this :)

@FarahSaeed
Copy link
Contributor

@gamazeps How would you mitigate this issue?

@amueller
Copy link
Member Author
amueller commented Apr 6, 2018

deprecate parameter and remove it from the docs, I think.

The other option would be to actually implement the thing that is implemented in other libraries. That would change behavior in a backward-incompatible way, but only if min_samples_split>1 and in that case I would consider it a bugfix.
Though that would be kind of interesting for @janvanrijn because his paper uses the current implementation very explicitly (and shows that it's not useful).

Copy link
Contributor
janvanrijn commented Apr 6, 2018

FFR the paper: https://arxiv.org/abs/1710.04725

according to the functional anova analysis this hyperparameter is the most important over 100 datasets (functional anova does not take into account defaults), but further analysis shows that this hyperparameter should often kept left to default.

Probably also interesting for @mfeurer , as autosklearn tries to optimize this parameter and might waste resources with that.

(updated, wrong link)

@FarahSaeed
Copy link
Contributor

@gamazeps you can try this any time... good luck with your attempt..

@lasagnaman
Copy link

I will take a stab at this!

@lasagnaman
Copy link

Is it also appropriate to add a note to docstrings for BaseDecisionTree and BaseGradientBoosting? currently those docstrings are just 1 liner "this is an abstract class, please use derived classes".

@rth
Copy link
Member
rth commented Aug 23, 2018

Fixed in #11870

@rth rth closed this as completed Aug 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Documentation Easy Well-defined and straightforward way to resolve help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
0