-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Bagging meta-estimator #2375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For the example it would be very interesting to try the |
That might be a good idea, but the digits dataset is actually quite small. It doesn't take less than a second to train an SVC on that - at the scale, I'd rather not make any conclusions if one appears faster than the other. |
You can nudge it as done in the RBM example to make it both larger and harder. |
In another direction, I was thinking about a figure like the ones I had done in my paper (see http://orbi.ulg.ac.be/bitstream/2268/130099/1/glouppe12.pdf page 11): it can be used to show the effect of Another great example would be a bias-variance decomposition of the error, illustrating what happens when base estimators are averaged together. (No matter what we choose here, such an example should anyway be in our documentation in my opinion...) |
|
+1 as well. |
I have got a working example of the bias-variance decomposition of the mean squared error of a single estimator versus bagging. It still needs some work and documentation, but here is how it renders on a toy 1d regression problem. The first plot displays the function to predict, the predictions of single estimators over several instances of the problem and the mean prediction. The second plot is a decomposition of the mean square error at point
In particular, one can see from the lower plot (compare the plain green line and the dashed greed line), or from the script output, that bagging mainly affect - and reduce - the variance part of the mean squared error. |
Nice plot, but I can't discern the different curves. What do you think of breaking in three plots (mse, bias^2, variance) using the same scale? Would it be interesting to add some noise? |
The plot is not up to date, see the next commits :) I'll refresh this when I'll be done. |
Here is an updated version of the example. See the explanations in the docstring for details.
I think this makes quite a nice example overall, illustrating both the bias-variance decomposition and the benefits of bagging. What do you think? @ogrisel @arjoly |
Very nice! |
It is also quite interesting to explore other base estimators (KNN, SVR, etc) :) |
estimators. The larger the variance, the more sensitive are the predictions for | 8000||
`x` to small changes in the training set. The bias term corresponds to the | ||
difference between the average prediction of the estimator (in cyan) and the | ||
best possible model (in dark blue). On this problem, we can thus observe than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
than => that?
The plot is a lot nicer !!! |
Yes and the GBRT model as well. Although this problem might be too easy to emphasize the interest of bagging GBRT models. |
y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2 | ||
y_var = np.var(y_predict, axis=1) | ||
|
||
print("{0}: {1} (error) = {2} (bias^2) + {3} (var) + {4} (noise)".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use {1:.4f}
to limit the precision to 4 decimal places and make the output easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was looking for that. Still not used to this Python3-way for formatting :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been there for quite some time, at least since python 2.6.
Before merging I would really like to have support for sparse I think it's worth doing it though (with tests). |
me don't like sparse formats I agree though, I'll look at this later. |
- If float, then draw `max_features * X.shape[1]` features. | ||
|
||
bootstrap : boolean, optional (default=False) | ||
Whether instances are drawn with replacement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instances => samples
This pr is already pretty large (around 1300 addition). I would prefer to keep this feature |
|
||
In regression, the expected mean squared error of an estimator can be | ||
decomposed in terms of bias, variance and noise. On average over dataset | ||
instances LS of the regression problem, the bias term measures the average |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LS?
let's merge this beast !!! +1 |
Thanks for your review Arnaud! @ogrisel Shall we merge this or wait for someone else review? |
I'll try to review tonight. @glouppe can you post the generated figures here? |
construction procedure and then making an ensemble out of it. In many cases, | ||
bagging methods constitute a very simple way to improve with respect to a | ||
single model, without making it necessary to adapt the underlying base | ||
algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the noobs, it might be useful to state explicitly that bagging should be used with strong learners and that it reduces overfit (and maybe to contrast it with boosting in this sense).
estimators_features)) | ||
|
||
|
||
def _partition_estimators(ensemble): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go in ensemble/base.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
LGTM! |
Any final words @larsmans? LGTM too. |
All tests pass on my box. Merged by hand after extensive rebase. |
Great! Thanks all! |
Great :-) !!! 🍻 |
Great! Thank you all for the reviews :) |
@larsmans By the way, did you had to squash everything into a single commit? :s |
It's one feature, so you get one commit for it ;) Seriously: this was the easiest way to get rid of the duplicate and typo commits. |
Nice work! |
Meh, why life is so hard? ; ; (joking) |
Git history in #2198 was messed up so I make a new pull request. Sorry for the noise...
TODO: