-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Computational engine plugin API with a "generic API" #24826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… (rather than the opposite)
… point is missing (rather than erroring out))
…nto wip-engines
…be executed twice
This renames the methods an engine has to/can implement along the pattern of `pre_X`, `X`, `post_X` (X=fit, transform, ..). Changes the engine loading mechanism to return all engines in a user defined order so that we can try ecah engine in turn to see which one supports the given input case.
Thoughts on the API of estimator engines:
Written in code https://gist.github.com/betatim/1d8d5dffbbded4c01ddaf3aa400c708d WDYT? |
Looking at this change, I'm wondering if there has been discussion with other projects (like SciPy) who have thought about the same problem? I see @betatim did try and respond on the forum when SPEC 2 was first envisioned (sorry you never got a response there, Tim!). Still, I think we should figure out these interfaces more broadly than just sklearn and SciPy. Here's an earlier comment of mine on @rgommers's proposal, discussing various use-cases. Here's a list of features I thought may be needed for skimage to adopt a similar approach. If we can design a good cross-project API, we won't need to refactor it 5 years down the line! |
Thanks for the interest @stefanv ! IIRC the proposed ideas in other places (that I've seen) are all based around making the decision on which backend to use based on the types of the arguments of a function. Do you know if there is a proposal that doesn't (exclusively) use types? As I see it for sklearn the decision mechanism has to be more complex because a backend might only be compatible with a certain set of hyper-parameters. Or use some other logic that is "not just" based on the types of the arguments. Thoughts from reading your comment on the scikit-image repo:
To your two questions my answer is: yes :D I think that installing a plugin is enough "opt-in" for users, if there is a runtime way to disable them. Defaulting to "if it is installed, it is activated". This means tracing which plugin is used when is even more important. It also makes it harder to determine what the order of plugins is. Explicit activation is a bit clunky IMHO because you are asking users to say "yes activate" twice (installing and then activating). However it makes it easy to determine the order: you ask the user to specify it (maybe based on the order in which they activate them). I floated the idea of using an existing plugin tool like pluggy (the system behind pytest plugins) as a way to get good/reasonable answers to some of the questions around "infrastructure". However there was not a lot of enthusiasm for this in sklearn (prefer to do everything from scratch). I still think using an existing system, that has a lot of users is, "not a bad choice" (akin to "python is the second best language for everything"). |
This allows us to raise an exception when an estimator is used with a different engine than the one used to fit it.
I can see the point but I'm concerned about bad things that can happen here if the users does not care enough about what plugins are installed (and bad practices of environment management are common). Maybe "if it is installed, it is activated" default could be only true for an ordered white list of trusted plugins ?
Personally I would wait to have more practical cases to define the kind of plugin features we're looking for, and be able to say if other existing tools can answer the needs or not. Also the current plugin implementation we propose is short and simple, at this point it can be iterated over, and replaced by something else, without too much worry I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposition Tim!
I think the two idea that you propose in this PR, that is to cite part of this PR description:
The core idea of the "unified API" idea" [].
[...] get_engine_class is now get_engine_classes.
Should be proposable independently.
I appreciate both.
However, I fear that the first one can't be easily compatible with other current design constraints (e.g. inputs' validation) and future design constraints (e.g. callback API) currently.
Still I believe we might be able to find a way through all those constraints with some design discussions.
I have done a small pass, but this PR's base's diff is already quite large. Probably #24497 can be split in several pieces targeting a feature branch? What do you think @ogrisel and @fcharras?
check_is_fitted(self) | ||
engine = self._get_engine(X) | ||
if hasattr(engine, "pre_transform"): | ||
X = engine.pre_transform(X) | ||
return self._transform(X, engine) | ||
|
||
def _transform(self, X, engine): | ||
"""Guts of transform method; no input validation.""" | ||
X_ = engine.transform(X) | ||
if hasattr(engine, "post_transform"): | ||
engine.post_transform(X_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently pre_transform
is done in transform
while post_transform
is done in _transform
.
Hence, are we sure that transform
is the only function which calls _transform
so that pre_transform
will always be called? Must it always be called? Similarly post_transform
is currently always called but must it be?
My instinct was to not impose a specific engine API for now to keep per-estimator freedom to design the best way to make them extendable on a case by case basis and reduce friction to engine development for the first few engines. And later once, we have implemented engines for a bunch of very heterogenous estimators (e.g. 3 or 4) we can start to decide whether or not those have redundant enough designs to refactor them to follow a unified engine API.
This looks good. We could factorize the I agree this warrants that all engines implements the "accept" method. My only worry is that |
Closing in favour of #24497 |
Reference Issues/PRs
This PR builds on #24497. It makes two additions: (1) a "unified engine API" and (2) iterating through a list of engines to find one that wants to handle the request.
xref #22438
What does this implement/fix? Explain your changes.
The difference to #24497 is that it switches to a "Unified engine API". This is mostly a change to how the existing methods of the engine are named. The goal of having a unified engine API is that we can re-use it for other estimators. Maybe there is a need for more than one engine API, but it seems like a good idea to start with one instead of starting from a "bespoke for every estimator" point. One of the great things about scikit-learn is the fact that all estimators, classifiers, regressors, etc follow a very well defined API. It makes sense to carry that good thing on.
The core idea of the "unified API" idea is that for every public method of an estimator the engine has to implement one corresponding method and can implement up to three methods. The methods are
pre_X
,X
(mandatory) andpost_X
(X =fit
,predict
,transform
, etc). I'm not sure yet we need the three stage process, but it seems like a good first attempt that allows engines to reuse code.The other change in this PR is that
get_engine_class
is nowget_engine_classes
. It returns all matching engine classes in the order the user configured. This allows us to try them in turn until we find one that wants to handle the request. The last engine is the default engine which will handle all requests (as a fallback). Instead of usingraise NotImplemented
, a new method is added to engines:accepts(X, y, sample_weights)
. It will be called after instantiating the engine class and should returnTrue
orFalse
to indicate if the engine wants to handle this case or not. At this point the engine has access to the estimator as well as the data. I think this is all the information it needs to make a decision.Any other comments?
I will attempt to switch
KNeighborsClassifier
to have an engine as well. Then we'd have two different estimators using engines.Something I've not worked out yet is what arguments should be passed to
pre_X
, andpost_X
and what exactly they should return. If you have input I'd be happy to hear it.