8000 Computational engine plugin API with a "generic API" by betatim · Pull Request #24826 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Computational engine plugin API with a "generic API" #24826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from

Conversation

betatim
Copy link
Member
@betatim betatim commented Nov 3, 2022

Reference Issues/PRs

This PR builds on #24497. It makes two additions: (1) a "unified engine API" and (2) iterating through a list of engines to find one that wants to handle the request.

xref #22438

What does this implement/fix? Explain your changes.

The difference to #24497 is that it switches to a "Unified engine API". This is mostly a change to how the existing methods of the engine are named. The goal of having a unified engine API is that we can re-use it for other estimators. Maybe there is a need for more than one engine API, but it seems like a good idea to start with one instead of starting from a "bespoke for every estimator" point. One of the great things about scikit-learn is the fact that all estimators, classifiers, regressors, etc follow a very well defined API. It makes sense to carry that good thing on.

The core idea of the "unified API" idea is that for every public method of an estimator the engine has to implement one corresponding method and can implement up to three methods. The methods are pre_X, X (mandatory) and post_X (X = fit, predict, transform, etc). I'm not sure yet we need the three stage process, but it seems like a good first attempt that allows engines to reuse code.

The other change in this PR is that get_engine_class is now get_engine_classes. It returns all matching engine classes in the order the user configured. This allows us to try them in turn until we find one that wants to handle the request. The last engine is the default engine which will handle all requests (as a fallback). Instead of using raise NotImplemented, a new method is added to engines: accepts(X, y, sample_weights). It will be called after instantiating the engine class and should return True or False to indicate if the engine wants to handle this case or not. At this point the engine has access to the estimator as well as the data. I think this is all the information it needs to make a decision.

Any other comments?

I will attempt to switch KNeighborsClassifier to have an engine as well. Then we'd have two different estimators using engines.

Something I've not worked out yet is what arguments should be passed to pre_X, and post_X and what exactly they should return. If you have input I'd be happy to hear it.

fcharras and others added 12 commits September 28, 2022 12:18
This renames the methods an engine has to/can implement along the
pattern of `pre_X`, `X`, `post_X` (X=fit, transform, ..).

Changes the engine loading mechanism to return all engines in a user
defined order so that we can try ecah engine in turn to see which one
supports the given input case.
@betatim
Copy link
Member Author
betatim commented Nov 8, 2022

Thoughts on the API of estimator engines:

  • for every public method of an estimator an engine can implement up to three methods: pre_fit (optional), fit (requried) and post_fit (optional). Using fit as example public method, replace fit with kneighbors for that public method, etc.
  • an engine has to implement the non-prefixed methods. In the example above that is fit, pre_fit and post_fit are optional.
  • the required method is called the "actual" method (better name welcome).
  • the pre, post and "actual" methods take the same arguments as the public method of estimator they correspond to.
  • the pre_ method is expected to return the arguments passed to it, it is allowed to modify them. For example it might want to convert them from numpy arrays to specialised device arrays.
  • if the public method of an estimator returns self then the engine method should not return anything. Otherwise the "actual" engine method should return the value that should be returned to the caller of the public estimator method.
  • an engine's post_ method may modify the estimator object

Written in code https://gist.github.com/betatim/1d8d5dffbbded4c01ddaf3aa400c708d

WDYT?

@stefanv
Copy link
Contributor
stefanv commented Nov 8, 2022

Looking at this change, I'm wondering if there has been discussion with other projects (like SciPy) who have thought about the same problem? I see @betatim did try and respond on the forum when SPEC 2 was first envisioned (sorry you never got a response there, Tim!). Still, I think we should figure out these interfaces more broadly than just sklearn and SciPy.

Here's an earlier comment of mine on @rgommers's proposal, discussing various use-cases. Here's a list of features I thought may be needed for skimage to adopt a similar approach.

If we can design a good cross-project API, we won't need to refactor it 5 years down the line!

@betatim
Copy link
Member Author
betatim commented Nov 9, 2022

Thanks for the interest @stefanv !

IIRC the proposed ideas in other places (that I've seen) are all based around making the decision on which backend to use based on the types of the arguments of a function. Do you know if there is a proposal that doesn't (exclusively) use types?

As I see it for sklearn the decision mechanism has to be more complex because a backend might only be compatible with a certain set of hyper-parameters. Or use some other logic that is "not just" based on the types of the arguments.

Thoughts from reading your comment on the scikit-image repo:

  1. I agree with your "attribute 1", knowing which backend was selected is important for users. Especially if you fallback to the default backend. Another thing to ponder for sklearn is the case where a particular backend got used when fiting an estimator but when calling predict a different backend is selected (at the very least we need to be able to give the user a good error message). I don't have a good idea on how to implement this yet.
  2. "all functions must be overridable" - all plugin systems I've seen rely on the host explicitly marking where a plugin can "hook" in. I think this is a sensible requirement, but might not be what you are looking for? If so, can you elaborate on your thoughts of why? This PR is my current best idea for how to reduce the amount of hassle on the host's side by having a "universal" API, instead of having a bespoke/per estimator API for plugins.
  3. Can you explain what you mean with your "attribute 3"? My brain isn't big enough to understand it :-/

To your two questions my answer is: yes :D I think that installing a plugin is enough "opt-in" for users, if there is a runtime way to disable them. Defaulting to "if it is installed, it is activated". This means tracing which plugin is used when is even more important. It also makes it harder to determine what the order of plugins is. Explicit activation is a bit clunky IMHO because you are asking users to say "yes activate" twice (installing and then activating). However it makes it easy to determine the order: you ask the user to specify it (maybe based on the order in which they activate them).

I floated the idea of using an existing plugin tool like pluggy (the system behind pytest plugins) as a way to get good/reasonable answers to some of the questions around "infrastructure". However there was not a lot of enthusiasm for this in sklearn (prefer to do everything from scratch). I still think using an existing system, that has a lot of users is, "not a bad choice" (akin to "python is the second best language for everything").

This allows us to raise an exception when an estimator is used with a
different engine than the one used to fit it.
@betatim betatim marked this pull request as ready for review November 15, 2022 09:06
@betatim betatim changed the title [Draft] Engine plugin API with a "unified API" Computational engine plugin API with a "generic API" Nov 15, 2022
@fcharras
Copy link
Contributor

Defaulting to "if it is installed, it is activated".

I can see the point but I'm concerned about bad things that can happen here if the users does not care enough about what plugins are installed (and bad practices of environment management are common). Maybe "if it is installed, it is activated" default could be only true for an ordered white list of trusted plugins ?

using an existing plugin tool like pluggy

Personally I would wait to have more practical cases to define the kind of plugin features we're looking for, and be able to say if other existing tools can answer the needs or not. Also the current plugin implementation we propose is short and simple, at this point it can be iterated over, and replaced by something else, without too much worry I think.

Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposition Tim!

I think the two idea that you propose in this PR, that is to cite part of this PR description:

The core idea of the "unified API" idea" [].

[...] get_engine_class is now get_engine_classes.

Should be proposable independently.

I appreciate both.

However, I fear that the first one can't be easily compatible with other current design constraints (e.g. inputs' validation) and future design constraints (e.g. callback API) currently.

Still I believe we might be able to find a way through all those constraints with some design discussions.

I have done a small pass, but this PR's base's diff is already quite large. Probably #24497 can be split in several pieces targeting a feature branch? What do you think @ogrisel and @fcharras?

Comment on lines +1711 to +1721
check_is_fitted(self)
engine = self._get_engine(X)
if hasattr(engine, "pre_transform"):
X = engine.pre_transform(X)
return self._transform(X, engine)

def _transform(self, X, engine):
"""Guts of transform method; no input validation."""
X_ = engine.transform(X)
if hasattr(engine, "post_transform"):
engine.post_transform(X_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently pre_transform is done in transform while post_transform is done in _transform.

Hence, are we sure that transform is the only function which calls _transform so that pre_transform will always be called? Must it always be called? Similarly post_transform is currently always called but must it be?

@ogrisel
Copy link
Member
ogrisel commented Dec 1, 2022

About the generic Engine API

My instinct was to not impose a specific engine API for now to keep per-estimator freedom to design the best way to make them extendable on a case by case basis and reduce friction to engine development for the first few engines.

And later once, we have implemented engines for a bunch of very heterogenous estimators (e.g. 3 or 4) we can start to decide whether or not those have redundant enough designs to refactor them to follow a unified engine API.

The other change in this PR is that get_engine_class is now get_engine_classes. It returns all matching engine classes in the order the user configured. This allows us to try them in turn until we find one that wants to handle the request. The last engine is the default engine which will handle all requests (as a fallback). Instead of using raise NotImplemented, a new method is added to engines: accepts(X, y, sample_weights). It will be called after instantiating the engine class and should return True or False to indicate if the engine wants to handle this case or not. At this point the engine has access to the estimator as well as the data. I think this is all the information it needs to make a decision.

This looks good. We could factorize the _get_engine negotiation method into a mixin class so that engine-enabled estimators do not have to reimplement it each time.

I agree this warrants that all engines implements the "accept" method.

My only worry is that accept and fit-time, engine-specific data validation can do redundant work. But maybe we can start this way and see later if we need to refactor the couple engine negotiation and fit-time data validation.

@betatim
Copy link
Member Author
betatim commented Dec 13, 2022

Closing in favour of #24497

@betatim betatim closed this Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0