8000 Fitting additional estimators for ensemble methods · Issue #1585 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Fitting additional estimators for ensemble methods #1585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jwkvam opened this issue Jan 16, 2013 · 73 comments
Closed

Fitting additional estimators for ensemble methods #1585

jwkvam opened this issue Jan 16, 2013 · 73 comments

Comments

@jwkvam
Copy link
Contributor
jwkvam commented Jan 16, 2013

I would like to propose an additional instance method to the ensemble estimators to fit additional sub-estimators. I kluged up an implementation for gradient boosting that appears to work through my limited testing. I was thinking the signature would be something like

def fit_extend(self, X, y, n_estimators):

where self.n_estimators += n_estimators is updated as so. I don't think fit_extend is a particularly great name so I'd welcome other suggestions. Perhaps we would want to hash the features and labels when fit() is called so we can check that the same features and labels are provided to this function.

If people think this would be a useful addition I would be willing to put together a PR, it seems like it should be straightforward to implement and add tests/docs for.

@amueller
Copy link
Member

This is definitely a feature we want. The question is: what would be the best way to implement it (in terms of API)?
There is something slightly similar in the adaboost pr: #522. That implements predicting with a subset of the estimators, which is also very helpful.

What do you think does the scenario / code look like, where a user wants fit_extend? It is probably most useful in an interactive setting, righ?

There is a slightly related function in SGD, partial_fit. That is actually for online learning, though, so it gets different data.

I'd like to get this feature with adding as little API an names as possible ;)

@amueller
Copy link
Member

Btw, I wouldn't hash X and y . I don't see a reason to force the user to provide the same input data.

@jwkvam jwkvam closed this as completed Jan 25, 2013
@jwkvam jwkvam reopened this Jan 25, 2013
@jwkvam
Copy link
Contributor Author
jwkvam commented Jan 25, 2013

I would like to train a small number of sub-estimators at a time (and wait a relatively short time). Then test it on my cross-validation set and if my cv score is still falling, I can continue training. As opposed to training a large number of sub-estimators and waiting a long time (several hours for me). That was my motivation.

I can understand being hesitant about adding another instance method. I thought it might be worthwhile to add another optional parameter to fit() but I saw this quote on the contributing page.

fit parameters should be restricted to directly data dependent variables

So I wasn't sure that would be a good idea. Would

def fit(self, X, y, n_estimators=self.n_estimators)

be acceptable? Then if n_estimators > self.n_estimators, we'll then train that many more estimators.

I agree that adding in n_estimators parameter to the prediction method is nice, but I think you'll agree that it solves a different problem. For my problem performing grid search over n_estimators isn't really an option because it takes so long.

@glouppe
Copy link
Contributor
glouppe commented Jan 25, 2013

Until we agree on a proper interface to do that, you could use the following hack:

# Train a forest of 10 trees
clf1 = RandomForestClassifier(n_estimators=10)
clf1.fit(X, y)

# Train a second forest of 10 trees
clf2 = RandomForestClassifier(n_estimators=10)
clf2.fit(X, y)

# Extend clf1 with clf2
clf1.estimators_.extend(clf2.estimators_)
clf1.n_estimators += clf2.n_estimators

# clf1 now counts 20 trees

@glouppe
Copy link
Contributor
glouppe commented Jan 25, 2013

Note that this only work for RandomForest and ExtraTrees. The same trick cannot be used with Gradient Boosting.

@amueller
Copy link
Member

See #1626. Would early stopping be an acceptable solution to you?

@jwkvam
Copy link
Contributor Author
jwkvam commented Jan 29, 2013

@amueller I share the same opinion as @glouppe here #1626 (comment). I like early stopping but it doesn't resolve this in my opinion.

@amueller
Copy link
Member

Ok. Then we should look for a solution that allows for early stopping and adding additional estimators.

@amueller
Copy link
Member

Thinking about it a bit more, I think the partial_fit method would be the right interface. In SGD you can call partial_fit either with the same data or new data and it keeps on learning. The difference is that in SGD, if you manually iterate over batches, you get the original algorithm out. For ensembles, that would not be true. You would need to use the whole data on each call to partial_fit.

@GaelVaroquaux
Copy link
Member

Thinking about it a bit more, I think the partial_fit method would be the right
interface.

I like this suggestion. What do other people think?

@glouppe
Copy link
Contributor
glouppe commented Jan 30, 2013

Just to clarify, what would exactly happen in partial_fit in case of ensembles? Would that add n_estimators more estimators, wheren_estimators is the parameter value from the constructor? (or could we change that value?)

@amueller
Copy link
Member

Good question. I also thought about that ;) actually, you would want to change that, right? you could change that afterwards by set_params but that feels awkward :-/

@pprett
Copy link
Member
pprett commented Jan 30, 2013

sorry for joining the discussion so late.

I agree that we need such a functionality, however, I'm not sure if fit_extends is the best solution to the problem that @jwkvam describes. In order to do early stopping the user has to write some some code that basically repeatedly calls fit_extends and then checks the CV error.

I'd rather propose the monitor fit parameter that we discussed in the past: est.fit(X, y, monitor=some_callable) where some_callable will be called after each iteration and is passed the complete state of the estimator. The callable could also return a value whether or not the training should proceed.

Using such an api one could implement not only early stopping but also custom reporting (e.g. interactive plotting the training vs. testing score) and snapshoting (all X iterations dump the estimator object and copy it to some location; this is great if you are running on EC2 spot instances or some other unreliable hardware ;-)

Even with such a monitor API, however, I think there would be a need for an API to fit more estimators once the model has been fitted (i.e. fit_extends) - often one trains a model and does some introspection to find that its probably better to have run more iterations - existing estimators use the warm_start parameter to implement such a functionality (e.g. see linear_model.ElasticNet) - here is the docstring of the parameter::

warm_start : bool, optional
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

Personally, I'd prefer fit_extends (or fit_more) over warm_start - warm start is quite implicit - you have to::

est = GradientBoostingRegressor(n_estimators=1000)
est.fit(X, y)

# now we want to fit more estimators to ``est`` 
# if you forget warm_start=True you nuke your previous estimators - quite implicit
est.fit(X, y, n_estimators=2000, warm_start=True)

# alternatively - more explicit
est.fit_more(X, y, n_estimators=1000)

@GaelVaroquaux
Copy link
Member

alternatively - more explicit

est.fit_more(X, y, n_estimators=1000)

To me, fit_more corresponds really to the partial_fit that we have in
other estimators.

@amueller
Copy link
Member

@pprett I think there should be an easy way to do easy things. a monitor api is very flexible but actually you want to do early stopping every time you use an estimator, right? So there should be no need to write a callback to do that. Also, it must be compatible with GridSearchCV.

@ogrisel
Copy link
Member
ogrisel commented Jan 30, 2013

To me, fit_more corresponds really to the partial_fit that we have in
other estimators.

I don't think so. In partial_fit, "partial" stands for partial access to the data: you expect that the data does not fit in memory at once so you fit with one chunk at a time and update the model incrementally while scanning through the data.

In this case we want to change the number of sub estimators but might want to reuse exactly the same data at each call.

For a similar reason ElasticNet has a warm_start constructor param instead of a partial_fit method and SGDClassifier both has a warm_start param and a partial_fit method: they serve different purposes.

I agree that the monitor API would be very useful in general (for dealing with snapshoting, early stopping and such) but would not solve the issue of growing the number of sub-estimators in an interactive manner.

We could also have:

est.fit(X, y, n_additional_estimators=1, warm_start=True)

Or even to grow by 110% (10% more estimators):

est.fit(X, y, additional_estimators=0.1, warm_start=True)

@amueller
Copy link
Member

hum I didn't look to much into the warm start api that we have currently. There is no central documentation for that, right?
We should really think about the organization of the docs. We got quite some comments on that in the survey :-/

@amueller
Copy link
Member

@ogrisel I'd have to have a look at the SGD implementation to see the details but what is the difference in what actually happens between warm-starts and partial_fit? I think we agree on the point of same /changing data.
Does warm_start do several epochs and partial_fit does not? That would make sense to me, and then we should probably keep them separate.
If we already have the warm-start api, we should definitely "just" implement that for the ensemble estimators.

@ogrisel
Copy link
Member
ogrisel commented Jan 30, 2013

warm_start just prevents fit to forget about the previous state (assuming that the inner state of the model will likely make it converge faster to the solution of the new call with the new hyperparameter).

@pprett
Copy link
Member
pprett commented Jan 30, 2013

2013/1/30 Andreas Mueller notifications@github.com

@ogrisel https://github.com/ogrisel I'd have to have a look at the SGD
implementation to see the details but what is the difference in what
actually happens between warm-starts and partial_fit? I think we agree on
the point of same /changing data.
Does warm_start do several epochs and partial_fit does not? That would
make sense to me, and then we should probably keep them separate.
If we already have the warm-start api, we should definitely "just"
implement that for the ensemble estimators.

I think the main difference is the semantics: the main idea behind
warm_start is to converge more quickly - but no matter what value
warm_start has you get the same solution!
Partial fit on the other hand, changes the underlying model. Consider the
following example:

# this is the intended use-case for warm_start is faster convergence

clf = SGDClassifier(n_epochs=10)
clf.fit(X, y)

clf2 = clone(clf)
clf3 = SGDClassifier(n_epochs=10)

clf2.fit(X, y, warm_start=True)
clf3.fit(X, y)

# clf2 and clf3 should converge to the same solution - but since clf3

can reuse the fitted weights from clf it might converge more quickly
# under the hood SGDClassifier.fit resets the "training" state mode
the estimator (adaptive learning rate for sgd)

# now partial fit
clf = SGDClassifier(n_epochs=10)
clf.partial_fit(X, y, classes)
# training has not completed yet "training" state (adaptive learning rate) is stored.

clf.partial_fit(X, y)  # resume with previous learning rate

Disclaimer: This example might be pedantic because the differences in terms
of the learned weights is minimal - but conceptually they are IMHO totally
different things...


Reply to this email directly or view it on GitHubhttps://github.com//issues/1585#issuecomment-12883146.

Peter Prettenhofer

@ogrisel
Copy link
Member
ogrisel commented Jan 30, 2013

The warm_start API was initially introduced to allow faster computation of a series of identical linear models when using a path of regularizers alpha. This is somewhat similar to iteratively growing the number of sub-estimators in a boosted ensemble model so we could decide to reuse warm_start to adress that use case as well but if this API reveals cumbersome for boosted models it might be better to rethink it now that we have an additional use case.

@ogrisel
Copy link
Member
ogrisel commented Jan 30, 2013

I agree with @pprett's analysis.

@amueller
Copy link
Member

I don't know what to make of @pprett analysis.

In the case of linear models, the estimator will converge to the same result, even when the warm start gets different data than the original fit. If we "warm started" ensembles / trees, that would not be the case.
We could try to assure that the data provided when warm starting is the same as the original.

At the moment, "warm start" refers to an optimization procedure, which there is none in tree based methods.
While partial fit retains all of the state of the estimator and just keeps on fitting.

On the other hand, subsequent calls to partial fit on batches lead to the same model as training on the whole data.

Again, this is different from the tree/ensemble case. I feel this goes back to my argument that this is more of a path algorithms than anything else ;)

@amueller
Copy link
Member

So I see two possible solutions: make sure warm-start is always called with the same data, then adding estimators would be warm starting.
If not, we need a third way to refit a given model.
Where are the docs for that currently, btw ;)

@ogrisel
Copy link
Member
ogrisel commented Jan 30, 2013

make sure warm-start is always called with the same data.

Why so? Let the user decide how and what for he / she want to use warm_start for.

@ogrisel
Copy link
Member
ogrisel commented Jan 30, 2013

Where are the docs for that currently, btw ;)

http://scikit-learn.org/dev/modules/generated/sklearn.linear_model.ElasticNet.html

warm_start : bool, optional
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

I agree that giving motivation would be helpful, for instance in this case:

"This is useful to efficiently compute a regularization path of ElasticNet models as done by the :func:enet_path function".

@amueller
Copy link
Member

I thought the argument was about semantics. I think a semantic is defined by giving the user some guarantee of what will happen. That way the user doesn't need to know all the details of the algorithm.
I thought the guarantee of warm_start was "warm_start doesn't change the result", while the guarantee of partial_fit was "iterating over batches doesn't change the result".

If there is no guarantee, then I don't see how there can be common semantics.

@amueller
Copy link
Member
amueller commented Feb 1, 2013

So what about

clf = RandomForestClassifier(n_estimators=10)
clf.fit(X, y)
print(clf.score(X, y))
clf.set_params(warm_start=true, n_estimators=20)
clf.fit(X, y)

Is that an acceptable usage pattern?

@amueller
Copy link
Member
amueller commented Feb 1, 2013

Or do you want these as parameters to fit? In SGD, warm_start is an __init__ parameter according to the docs.

@amueller
Copy link
Member
amueller commented Feb 8, 2013

Let's revive the discussion. in #1044 @GaelVaroquaux said he still prefers partial_fit.
Currently, I think warm_start is more in the right direction, but I don't have a strong opinion. @ogrisel @pprett @glouppe @larsmans what is your opinion on the usage pattern I posted above? Or would you like to have another interface using warm_start or partial_fit?

@GaelVaroquaux
Copy link
Member

Currently, I think warm_start is more in the right direction, but I don't have
a strong opinion.

What I dislike about using the 'warm_start' is that currently the
contract with scikit-learn estimators is that you can call 'fit' and get
a valid/useful answer regardless of the history of the object. It may go
faster or slower, but it's somewhat fool proof. If you pass different
data to an ensemble estimator, and use the 'warm_start' to fit more
estimators, you will get nonsens. I am worried about having to write
'defensive' code to avoid such problems.

@pprett
Copy link
Member
pprett commented Feb 8, 2013

how would partial_fit work in our setting - is this correct::

est = GradientBoostingRegressor(n_estimators=1000)
est.fit(X, y)
...
est.fit_partial(X, y, n_estimators=1000)  # train another 1000

so it would take arbitrary fit_params or just n_estimators?

Personally, I'm in favor of a fit_more since the use-case that our current partial_fit serves is quite different and fit_more is more explicit.

@glouppe
Copy link
Contributor
glouppe commented Feb 8, 2013

I am also not very happy with the name partial_fit in case of ensembles. From my point of view, that name suggests that it will build some estimators out of the total number requested in the constructor, but not more.

If we go for warm_start then what would be the specification? You set n_estimators in the constructor and calling fit append n_estimators more estimators? Just like @amueller did above? Well I am not against that pattern, but that does not seem very intuitive to me nevertheless.

From a very practical point of view, I like fit_more. It is explicit. No explanation required. However, it adds another function to our API...

(I have no strong opinion yet, these remarks simply reflect what I think at the moment)

@amueller
Copy link
Member
amueller commented Feb 8, 2013

I am not completely against adding a function, but I wouldn't like it to be to specific to the ensembles.
I really do see a connection to the path algorithms so I think sharing an interface would be nice.

Consider the following hypothetical situation (maybe not so realistic):
You fitted an ensemble but now you see that you underfit and want to make your trees deeper (let's say we implemented that). This would be another example of path-like behavior. Would you also do that via fit_more? Or add a fit_deeper function?

I guess there is a trade-off between generality and explicitness.

@amueller
Copy link
Member
amueller commented Feb 8, 2013

@GaelVaroquaux The contract with partial_fit is imho that if you iterate over the data in batches, you will get the same result out. That will definitely not be the case if used here. So by design we would break the contract ?!

@amueller
Copy link
Member

Thinking about it again, maybe there is room for a new method which we could use to implement #1626.
I wouldn't mind calling it fit_more, but in the sense of do some more fitting along the parameter path not in the sense of fit additional estimators in the ensemble.

So imho we should either do warm_start ( + maybe defensive programming ) or add another method that we can generally use to fit along a parameter path.

@amueller
Copy link
Member

Would fit_more then be defensive or not? ;)

@glouppe
Copy link
Contributor
glouppe commented Feb 11, 2013

-1 on defensive. I'd rather document it well and let the user decides what
is good for oneself.

On 11 February 2013 22:14, Andreas Mueller notifications@github.com wrote:

Would fit_more then be defensive or not? ;)


Reply to this email directly or view it on GitHubhttps://github.com//issues/1585#issuecomment-13403244.

@amueller
Copy link
Member

I would also be against defensive. I was just wondering if adding the function really solved an issue or if we just added another way to do warm starts. Both have the same defensive / not-defensive problem, right?

@jwkvam
Copy link
Contributor Author
jwkvam commented Feb 12, 2013

My apologies if I'm simply repeating what has already been said. But it seems like you could split estimators into two classes: those that freeze parameters once they are fit (ensembles, DTs), and those that don't (linear models). By that I mean with warm_start you won't refit the first n sub-estimators of an ensemble or the existing splits in a decision tree. The lack of being able to reach anywhere in the parameter space with warm_start for ensembles and DTs makes me think that an instance method would be more appropriate.

If an instance method is chosen, does it need to be more general as @amueller noted? If at some point someone wanted the ability to increase the max_depth of the sub-estimators, that could also be handled with fit_more()?

For what it's worth, I would also be against defensive. As @GaelVaroquaux pointed out earlier it provides a sub-sampling strategy, for instance, if your training data doesn't fit in main memory.

@glouppe
Copy link
Contributor
glouppe commented Feb 13, 2013

After some thoughts, I think we should see the bigger picture here. In a near future, I would like to implement generic meta-ensembles that could combine any kind of estimators together. What I rather see is a "combination" mechanism that would take as input a list of (fitted) estimators and would produce a meta-estimator combining them all.

In practice, I think we can achieve that without adding any new function to our API. For example, one could simply pass such a list of fitted estimators to the constructor of the meta-ensemble.

In terms of API, one could (roughly) implement such ensembles in the following way:

a) Bagging:

  • constructor: base_estimator (optional), n_estimators (>=0), a list L of fitted estimators (optional).
  • fit: extend L with n_estimators new instances of base_estimators fitted over (bootstrap copies of) the training samples. If no base estimator is given, then it is equivalent to combining the estimators in L.

b) Stacking:

  • constructor: base_estimator (optional), n_estimators (>=0), a list L of fitted estimators (optional).
  • fit: extend L with n_estimators new instances of base_estimators fitted of bootstrap samples, then refit a model over the predictions of the estimators.

c) Forest:

  • constructor: base_estimator (optional), n_estimators (>=0), a list L of fitted estimators or a forest (optional).
  • fit: extend L with n_estimators new instances of base_estimators fitted over the training samples. Here we could also check whether the estimators in L are forests or decision trees. Forests would be flattened in order to put all trees on the same level.

Also, in such a framework, computation of an ensemble could easily be distributed over several machines: build your estimators; pickle them; then recombine them into one single meta-estimator. One could even wrap that interface into a MapReduce cluster, without digging into our implementation at all!

What do you think? I am aware this is only relevant to some kind of ensembles though. For instance, GBRT and AdaBoost are (in my opinion) more suited to either warm_restart or partial_fit.

@glouppe
Copy link
Contributor
glouppe commented Feb 13, 2013

Just to be clear, to extend a forest, one would do something like:

forest = RandomForestClassifier(n_estimators=100)
forest.fit()
forest_extended = RandomForestClassifier(n_estimators=100, L=forest)
forest_extended.fit() # now counts 200 trees

@amueller
Copy link
Member

What is the motivation of that interface? I am totally with you in supporting more ensemble methods. I just feel it is quite awkward to have a different interface for GBRT and random forest. I don't really see the motivation for that.

If the main motivation is to distribute embarassingly parallel jobs, then I think we should attack this by implementing a more powerful parallelization. Doing it the way you described seems pretty manual and hacky.

Basically I feel your proposal just solves a very special case and leaves most cases unsolved.

@glouppe
Copy link
Contributor
glouppe commented Feb 13, 2013

Well ok... I just feel that extending boosted-like ensembles and average-like ensembles are quite different things.

@amueller
Copy link
Member

What is the use-case for your interface except parallelization? Or better: in what use cases do you need a different interface for boosted ensembles and bagging?

@glouppe
Copy link
Contributor
glouppe commented Feb 13, 2013

The use case is when you want to combine several estimators together. It is natural for average-like ensembles, but makes no sense in boosted ensembles. In that perspective, I see "extending an estimator" as "combining" it with more base estimators.

@amueller
Copy link
Member

So the setting is that you have trained some bagging estimators and want to combine them together, right?
In which setting do you want to do that except for parallelization? It is not so clear to me but maybe I'm overlooking something obvious.

@glouppe
Copy link
Contributor
glouppe commented Feb 13, 2013

In case of Stacking the estimators might be completely different (say to you want to merge forests with svms).

(Indirectly, this could also be used to implement subsampling strategies or for monitoring the fitting process.)

@amueller
Copy link
Member

I'm not sure I get the stacking example. I would have imagined that if we had a stacking interface, you could specify one estimator as the base estimator and another as the one on top.

@glouppe
Copy link
Contributor
glouppe commented Feb 13, 2013

As I see it, the point of stacking is to combine the predictions of estimators of different nature. The more diverse they are, often the better.

@amueller
Copy link
Member

Ok, so the base estimators would be different. But then we could also build this into the interface for stacking, right?

@jwkvam
Copy link
Contributor Author
jwkvam commented Jan 9, 2014

Resolved with #2570

@glouppe
Copy link
Contributor
glouppe commented Jan 9, 2014

@jwkvam We recently agreed in #2570 to implement this feature using the warm_start parameter. It is now implemented in GBRT. I'll try to update the forests with the same mechanism before the release.

@jwkvam
Copy link
Contributor Author
jwkvam commented Jan 9, 2014

@glouppe You're right, I forgot I had written this for any ensemble. But really I just wanted it for GBRT :) so in my haste, I decided this issue was resolved. If you like you can reopen it and close it when you are done, it doesn't matter to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
0