8000 Problem with ducktyping and meta-estimators · Issue #1805 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Problem with ducktyping and meta-estimators #1805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Mar 23, 2013 · 19 comments
Closed

Problem with ducktyping and meta-estimators #1805

jnothman opened this issue Mar 23, 2013 · 19 comments
Labels
Milestone

Comments

@jnothman
Copy link
Member

In a number of places, sklearn controls flow according to the existence of some method on an estimator. For example: *SearchCV.score checks for score on the estimator; Scorer and multiclass functions check for decision_function; and it is used for validation in AdaBoostClassifier.fit, multiclass._check_estimator and Pipeline; and for testing in test_common.

Meta-estimators such as *SearchCV, Pipeline, RFECV, etc. should respond to such hasattrs in agreement with their underlying estimators (or else the hasattr approach should be avoided).

This is possible by implementing such methods with a property that returns the correct method from the sub-estimator (or a closure around it), or raises AttributeError if the sub-estimator is found lacking (see #1801). hasattr would then function correctly. Caveats: the code would be less straightforward in some cases; help()/pydoc won't show the methods as methods (with an argument list, etc.), though the property's docstring will show.

@larsmans
Copy link
Member

An alternative (which would break backward compat) is to split Pipeline etc. into classes PipelineClassifier, PipelineRegressor. That would mean a lot of work, but in the end it may turn out to be much easier to maintain.

@jnothman
Copy link
Member Author

I don't think that distinction separates estimators with/without score or decision_function... does it?

And it does nothing for *SearchCV etc

@amueller
Copy link
Member

Yeah, I think we should stick with the ducktyping. Are there currently any failures?

@jnothman
Copy link
Member Author
jnothman commented Apr 2, 2013

I agree we should currently stick with ducktyping, but we need to use something like properties to ensure the wrapping estimator reflects the underlying estimator's capabilities, or else the ductyping is broken.

One can come up with examples that fail on the current release, such as calling GridSearch(Pipeline(...)).fit(...).score(...) where the final estimator in the pipeline lacks a score method, but there are currently no tests.

@GaelVaroquaux
Copy link
Member

I agree we should currently stick with ducktyping, but we need to use
something like properties to ensure the wrapping estimator reflects the
underlying estimator's capabilities, or else the ductyping is broken.

I am OK with properties here: adaptation is a valid use case for
properties, IMHO.

They should be well documented in the class docstring, though.

Thanks for bugging us about these core problems!

@amueller
Copy link
Member
amueller commented Apr 2, 2013

Basically we need to fix Pipeline, right?
I thought we threw AttributeError there, but apparently we don't.
The GridSearch* should be good now, after your fix, correct?

About the *CV objects: I don't like them any way ;)

@GaelVaroquaux
Copy link
Member

About the *CV objects: I don't like them any way ;)

Their code is horrendus, but they are very useful. I agree that they need
a refactor.

@amueller
Copy link
Member
amueller commented Apr 2, 2013

My main problem is that they don't work together with GridSearchCV, so they are useless in Pipelines.

@GaelVaroquaux
Copy link
Member

My main problem is that they don't work together with GridSearchCV, so
they are useless in Pipelines.

Yes, we have a problem with a notion of out of bag validation. With the
excellent work that was done on the scorer objects, I think that this
pattern could be implemented cleanely with CV=1 and a scorer object that
carries the validation data.

@amueller
Copy link
Member
amueller commented Apr 2, 2013

have you looked at what I wrote in #1626?

Letting the Scorer handle the validation would be a possibility, still GridSearchCV would need to know which groups of parameters it would have to pass on and how to interpret the returned scores.

@jnothman
Copy link
Member Author
jnothman commented Apr 2, 2013

Basically we need to fix Pipeline, right?
I thought we threw AttributeError there, but apparently we don't.

You can't correctly throw AttributeError without either setting the attribute to exist upon availability or using magic methods (__getattr__) or attribute descriptors like property. By simply implementing methods, Pipeline cannot possibly be doing the right thing for this case.

But I don't know if Pipeline is the only issue (after my fix to *SearchCV in #1801). I don't know the code-base well enough. The issue also involves a more general one of making sure there is a clear list of methods that meta-estimators should consider supporting for pass-through type operations.

I assume your complaints above are about things like RFECV, which stem from the inability to use *SearchCV in settings where previous results can dynamically inform future results without (fully) re-fitting a model. I think the different cases for high efficiency here are more various than #1626 suggests; I think the simplest involves having an ordering or topology of the relevant parameters so that low-order parameter values may sometimes be changed without refitting, where parallelisation occurs across folds and high-order parameters; and the matter is confused somewhat by the existing parameter-get/setting API.

@amueller
Copy link
Member
amueller commented Apr 3, 2013

Could you elaborate on the last comments? How would you change the parameter API to support changing parameters without refitting?
I think RFECV is the exception here, rather than the rule, btw.

@amueller
Copy link
Member
amueller commented Apr 3, 2013

One problem with the "no refitting required" idea might be that it need to know the "worst case parameter" beforehand, i.e. the setting which needs the most computation. This might not be known beforehand when doing random sampling. While doing random parameter sampling on a CV object alone makes no sense, in a pipeline I think it does.

@amueller
Copy link
Member
amueller commented Apr 3, 2013

@jnothman you are obviously right that when doing hasattr we can't correctly throw an error. But I think we started using try ... except AttributeError, NotImplementedError at some point. Not in the GridSearchCV apparently.
There was another issue with hasattr that could probably also be solved with attributes: SVC and SGDClassifier implement predict_proba only when loss='log'. I think that was the reason for the try ... except.

@jnothman
Copy link
Member Author
jnothman commented Apr 3, 2013

[The topic of changing parameters without refitting isn't really within scope here, and I keep coming up with refinements to each solution I think of. But let's say we didn't just want it to apply to *SearchCV, so that for instance SelectKBest knew it didn't need to refit for a change of k. Perhaps the cleanest approach could involve calling a method est.needs_refit(k=new_value) before calling est.set_params(k=new_value). Ideally we should design the search path that maximises the frequency of needs_refit returning false. Without trial and error (which may be somewhat reasonable), the estimators then need to provide information about what the caller should expect needs_refit to return: perhaps a set of parameter names for which refitting will never be required; but this won't suffice when there are value ordering constraints, or parameter ordering constraints, a simple example being a Pipeline where parameter modifications upstream affect downstream, and so any changes should be made from the last estimator to the first. So unless we went for a heuristic/random search over needs_refit, estimators themselves would need to either specify an order of parameters and values, or provide a way to order a given parameter grid.]

I think the cases you're talking about in the last comment, @amueller, present exactly the issue. Do we want to be able to use hasattr or is except (AttributeError, NotImplementedError) the right approach to use everywhere ducktyping is needed? I get the impression it's the former, in which case we should be using properties, and testing that they work.

@amueller
Copy link
Member
amueller commented Apr 3, 2013

Agreed :) [second part]

@amueller
Copy link
Member

this is a core problem that we need to solve but we haven't discussed your fix enough to be able to merge it :-/

jnothman added a commit to jnothman/scikit-learn that referenced this issue Feb 5, 2014
jnothman added a commit to jnothman/scikit-learn that referenced this issue Feb 5, 2014
jnothman added a commit to jnothman/scikit-learn that referenced this issue Feb 5, 2014
jnothman added a commit to jnothman/scikit-learn that referenced this issue Feb 5, 2014
jnothman added a commit to jnothman/scikit-learn that referenced this issue Feb 5, 2014
@jnothman
Copy link
Member Author

#3484 is more-or-less a duplicate

@jnothman
Copy link
Member Author

Yay! Thanks to all involved in finally closing this!

On 27 December 2014 at 23:59, Lars notifications@github.com wrote:

Closed #1805 #1805
via 8e3f21d
8e3f21d
.


Reply to this email directly or view it on GitHub
#1805 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0