-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Fitting additional estimators for ensemble methods #1585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is definitely a feature we want. The question is: what would be the best way to implement it (in terms of API)? What do you think does the scenario / code look like, where a user wants There is a slightly related function in SGD, I'd like to get this feature with adding as little API an names as possible ;) |
Btw, I wouldn't hash |
I would like to train a small number of sub-estimators at a time (and wait a relatively short time). Then test it on my cross-validation set and if my cv score is still falling, I can continue training. As opposed to training a large number of sub-estimators and waiting a long time (several hours for me). That was my motivation. I can understand being hesitant about adding another instance method. I thought it might be worthwhile to add another optional parameter to fit() but I saw this quote on the contributing page.
So I wasn't sure that would be a good idea. Would def fit(self, X, y, n_estimators=self.n_estimators) be acceptable? Then if I agree that adding in n_estimators parameter to the prediction method is nice, but I think you'll agree that it solves a different problem. For my problem performing grid search over n_estimators isn't really an option because it takes so long. |
Until we agree on a proper interface to do that, you could use the following hack:
|
Note that this only work for RandomForest and ExtraTrees. The same trick cannot be used with Gradient Boosting. |
See #1626. Would early stopping be an acceptable solution to you? |
@amueller I share the same opinion as @glouppe here #1626 (comment). I like early stopping but it doesn't resolve this in my opinion. |
Ok. Then we should look for a solution that allows for early stopping and adding additional estimators. |
Thinking about it a bit more, I think the |
I like this suggestion. What do other people think? |
Just to clarify, what would exactly happen in |
Good question. I also thought about that ;) actually, you would want to change that, right? you could change that afterwards by |
sorry for joining the discussion so late. I agree that we need such a functionality, however, I'm not sure if I'd rather propose the Using such an api one could implement not only early stopping but also custom reporting (e.g. interactive plotting the training vs. testing score) and snapshoting (all X iterations dump the estimator object and copy it to some location; this is great if you are running on EC2 spot instances or some other unreliable hardware ;-) Even with such a
Personally, I'd prefer
|
To me, fit_more corresponds really to the partial_fit that we have in |
@pprett I think there should be an easy way to do easy things. a monitor api is very flexible but actually you want to do early stopping every time you use an estimator, right? So there should be no need to write a callback to do that. Also, it must be compatible with GridSearchCV. |
I don't think so. In In this case we want to change the number of sub estimators but might want to reuse exactly the same data at each call. For a similar reason ElasticNet has a I agree that the monitor API would be very useful in general (for dealing with snapshoting, early stopping and such) but would not solve the issue of growing the number of sub-estimators in an interactive manner. We could also have:
Or even to grow by 110% (10% more estimators):
|
hum I didn't look to much into the warm start api that we have currently. There is no central documentation for that, right? |
@ogrisel I'd have to have a look at the SGD implementation to see the details but what is the difference in what actually happens between warm-starts and partial_fit? I think we agree on the point of same /changing data. |
|
2013/1/30 Andreas Mueller notifications@github.com
I think the main difference is the semantics: the main idea behind
can reuse the fitted weights from clf it might converge more quickly
Disclaimer: This example might be pedantic because the differences in terms
Peter Prettenhofer |
The |
I agree with @pprett's analysis. |
I don't know what to make of @pprett analysis. In the case of linear models, the estimator will converge to the same result, even when the warm start gets different data than the original fit. If we "warm started" ensembles / trees, that would not be the case. At the moment, "warm start" refers to an optimization procedure, which there is none in tree based methods. On the other hand, subsequent calls to partial fit on batches lead to the same model as training on the whole data. Again, this is different from the tree/ensemble case. I feel this goes back to my argument that this is more of a path algorithms than anything else ;) |
So I see two possible solutions: make sure warm-start is always called with the same data, then adding estimators would be warm starting. |
Why so? Let the user decide how and what for he / she want to use |
http://scikit-learn.org/dev/modules/generated/sklearn.linear_model.ElasticNet.html
I agree that giving motivation would be helpful, for instance in this case: "This is useful to efficiently compute a regularization path of ElasticNet models as done by the :func: |
I thought the argument was about semantics. I think a semantic is defined by giving the user some guarantee of what will happen. That way the user doesn't need to know all the details of the algorithm. If there is no guarantee, then I don't see how there can be common semantics. |
So what about
Is that an acceptable usage pattern? |
Or do you want these as parameters to |
Let's revive the discussion. in #1044 @GaelVaroquaux said he still prefers |
What I dislike about using the 'warm_start' is that currently the |
how would
so it would take arbitrary Personally, I'm in favor of a |
I am also not very happy with the name If we go for From a very practical point of view, I like (I have no strong opinion yet, these remarks simply reflect what I think at the moment) |
I am not completely against adding a function, but I wouldn't like it to be to specific to the ensembles. Consider the following hypothetical situation (maybe not so realistic): I guess there is a trade-off between generality and explicitness. |
@GaelVaroquaux The contract with |
Thinking about it again, maybe there is room for a new method which we could use to implement #1626. So imho we should either do |
Would |
-1 on defensive. I'd rather document it well and let the user decides what On 11 February 2013 22:14, Andreas Mueller notifications@github.com wrote:
|
I would also be against defensive. I was just wondering if adding the function really solved an issue or if we just added another way to do warm starts. Both have the same defensive / not-defensive problem, right? |
My apologies if I'm simply repeating what has already been said. But it seems like you could split estimators into two classes: those that freeze parameters once they are fit (ensembles, DTs), and those that don't (linear models). By that I mean with warm_start you won't refit the first n sub-estimators of an ensemble or the existing splits in a decision tree. The lack of being able to reach anywhere in the parameter space with warm_start for ensembles and DTs makes me think that an instance method would be more appropriate. If an instance method is chosen, does it need to be more general as @amueller noted? If at some point someone wanted the ability to increase the max_depth of the sub-estimators, that could also be handled with For what it's worth, I would also be against defensive. As @GaelVaroquaux pointed out earlier it provides a sub-sampling strategy, for instance, if your training data doesn't fit in main memory. |
After some thoughts, I think we should see the bigger picture here. In a near future, I would like to implement generic meta-ensembles that could combine any kind of estimators together. What I rather see is a "combination" mechanism that would take as input a list of (fitted) estimators and would produce a meta-estimator combining them all. In practice, I think we can achieve that without adding any new function to our API. For example, one could simply pass such a list of fitted estimators to the constructor of the meta-ensemble. In terms of API, one could (roughly) implement such ensembles in the following way: a) Bagging:
b) Stacking:
c) Forest:
Also, in such a framework, computation of an ensemble could easily be distributed over several machines: build your estimators; pickle them; then recombine them into one single meta-estimator. One could even wrap that interface into a MapReduce cluster, without digging into our implementation at all! What do you think? I am aware this is only relevant to some kind of ensembles though. For instance, GBRT and AdaBoost are (in my opinion) more suited to either |
Just to be clear, to extend a forest, one would do something like:
|
What is the motivation of that interface? I am totally with you in supporting more ensemble methods. I just feel it is quite awkward to have a different interface for GBRT and random forest. I don't really see the motivation for that. If the main motivation is to distribute embarassingly parallel jobs, then I think we should attack this by implementing a more powerful parallelization. Doing it the way you described seems pretty manual and hacky. Basically I feel your proposal just solves a very special case and leaves most cases unsolved. |
Well ok... I just feel that extending boosted-like ensembles and average-like ensembles are quite different things. |
What is the use-case for your interface except parallelization? Or better: in what use cases do you need a different interface for boosted ensembles and bagging? |
The use case is when you want to combine several estimators together. It is natural for average-like ensembles, but makes no sense in boosted ensembles. In that perspective, I see "extending an estimator" as "combining" it with more base estimators. |
So the setting is that you have trained some bagging estimators and want to combine them together, right? |
In case of Stacking the estimators might be completely different (say to you want to merge forests with svms). (Indirectly, this could also be used to implement subsampling strategies or for monitoring the fitting process.) |
I'm not sure I get the stacking example. I would have imagined that if we had a stacking interface, you could specify one estimator as the base estimator and another as the one on top. |
As I see it, the point of stacking is to combine the predictions of estimators of different nature. The more diverse they are, often the better. |
Ok, so the base estimators would be different. But then we could also build this into the interface for stacking, right? |
Resolved with #2570 |
@glouppe You're right, I forgot I had written this for any ensemble. But really I just wanted it for GBRT :) so in my haste, I decided this issue was resolved. If you like you can reopen it and close it when you are done, it doesn't matter to me. |
I would like to propose an additional instance method to the ensemble estimators to fit additional sub-estimators. I kluged up an implementation for gradient boosting that appears to work through my limited testing. I was thinking the signature would be something like
where
self.n_estimators += n_estimators
is updated as so. I don't think fit_extend is a particularly great name so I'd welcome other suggestions. Perhaps we would want to hash the features and labels when fit() is called so we can check that the same features and labels are provided to this function.If people think this would be a useful addition I would be willing to put together a PR, it seems like it should be straightforward to implement and add tests/docs for.
The text was updated successfully, but these errors were encountered: