Computational engine plugin API with a "generic API" #24826

betatim · 2022-11-03T16:54:18Z

Reference Issues/PRs

This PR builds on #24497. It makes two additions: (1) a "unified engine API" and (2) iterating through a list of engines to find one that wants to handle the request.

xref #22438

What does this implement/fix? Explain your changes.

The difference to #24497 is that it switches to a "Unified engine API". This is mostly a change to how the existing methods of the engine are named. The goal of having a unified engine API is that we can re-use it for other estimators. Maybe there is a need for more than one engine API, but it seems like a good idea to start with one instead of starting from a "bespoke for every estimator" point. One of the great things about scikit-learn is the fact that all estimators, classifiers, regressors, etc follow a very well defined API. It makes sense to carry that good thing on.

The core idea of the "unified API" idea is that for every public method of an estimator the engine has to implement one corresponding method and can implement up to three methods. The methods are pre_X, X (mandatory) and post_X (X = fit, predict, transform, etc). I'm not sure yet we need the three stage process, but it seems like a good first attempt that allows engines to reuse code.

The other change in this PR is that get_engine_class is now get_engine_classes. It returns all matching engine classes in the order the user configured. This allows us to try them in turn until we find one that wants to handle the request. The last engine is the default engine which will handle all requests (as a fallback). Instead of using raise NotImplemented, a new method is added to engines: accepts(X, y, sample_weights). It will be called after instantiating the engine class and should return True or False to indicate if the engine wants to handle this case or not. At this point the engine has access to the estimator as well as the data. I think this is all the information it needs to make a decision.

Any other comments?

I will attempt to switch KNeighborsClassifier to have an engine as well. Then we'd have two different estimators using engines.

Something I've not worked out yet is what arguments should be passed to pre_X, and post_X and what exactly they should return. If you have input I'd be happy to hear it.

… (rather than the opposite)

…yword

… point is missing (rather than erroring out))

…nto wip-engines

…be executed twice

…tedByEngineError

This renames the methods an engine has to/can implement along the pattern of `pre_X`, `X`, `post_X` (X=fit, transform, ..). Changes the engine loading mechanism to return all engines in a user defined order so that we can try ecah engine in turn to see which one supports the given input case.

betatim · 2022-11-08T10:01:13Z

Thoughts on the API of estimator engines:

for every public method of an estimator an engine can implement up to three methods: pre_fit (optional), fit (requried) and post_fit (optional). Using fit as example public method, replace fit with kneighbors for that public method, etc.
an engine has to implement the non-prefixed methods. In the example above that is fit, pre_fit and post_fit are optional.
the required method is called the "actual" method (better name welcome).
the pre, post and "actual" methods take the same arguments as the public method of estimator they correspond to.
the pre_ method is expected to return the arguments passed to it, it is allowed to modify them. For example it might want to convert them from numpy arrays to specialised device arrays.
if the public method of an estimator returns self then the engine method should not return anything. Otherwise the "actual" engine method should return the value that should be returned to the caller of the public estimator method.
an engine's post_ method may modify the estimator object

Written in code https://gist.github.com/betatim/1d8d5dffbbded4c01ddaf3aa400c708d

WDYT?

stefanv · 2022-11-08T22:37:54Z

Looking at this change, I'm wondering if there has been discussion with other projects (like SciPy) who have thought about the same problem? I see @betatim did try and respond on the forum when SPEC 2 was first envisioned (sorry you never got a response there, Tim!). Still, I think we should figure out these interfaces more broadly than just sklearn and SciPy.

Here's an earlier comment of mine on @rgommers's proposal, discussing various use-cases. Here's a list of features I thought may be needed for skimage to adopt a similar approach.

If we can design a good cross-project API, we won't need to refactor it 5 years down the line!

betatim · 2022-11-09T07:36:39Z

Thanks for the interest @stefanv !

IIRC the proposed ideas in other places (that I've seen) are all based around making the decision on which backend to use based on the types of the arguments of a function. Do you know if there is a proposal that doesn't (exclusively) use types?

As I see it for sklearn the decision mechanism has to be more complex because a backend might only be compatible with a certain set of hyper-parameters. Or use some other logic that is "not just" based on the types of the arguments.

Thoughts from reading your comment on the scikit-image repo:

I agree with your "attribute 1", knowing which backend was selected is important for users. Especially if you fallback to the default backend. Another thing to ponder for sklearn is the case where a particular backend got used when fiting an estimator but when calling predict a different backend is selected (at the very least we need to be able to give the user a good error message). I don't have a good idea on how to implement this yet.
"all functions must be overridable" - all plugin systems I've seen rely on the host explicitly marking where a plugin can "hook" in. I think this is a sensible requirement, but might not be what you are looking for? If so, can you elaborate on your thoughts of why? This PR is my current best idea for how to reduce the amount of hassle on the host's side by having a "universal" API, instead of having a bespoke/per estimator API for plugins.
Can you explain what you mean with your "attribute 3"? My brain isn't big enough to understand it :-/

To your two questions my answer is: yes :D I think that installing a plugin is enough "opt-in" for users, if there is a runtime way to disable them. Defaulting to "if it is installed, it is activated". This means tracing which plugin is used when is even more important. It also makes it harder to determine what the order of plugins is. Explicit activation is a bit clunky IMHO because you are asking users to say "yes activate" twice (installing and then activating). However it makes it easy to determine the order: you ask the user to specify it (maybe based on the order in which they activate them).

I floated the idea of using an existing plugin tool like pluggy (the system behind pytest plugins) as a way to get good/reasonable answers to some of the questions around "infrastructure". However there was not a lot of enthusiasm for this in sklearn (prefer to do everything from scratch). I still think using an existing system, that has a lot of users is, "not a bad choice" (akin to "python is the second best language for everything").

This allows us to raise an exception when an estimator is used with a different engine than the one used to fit it.

fcharras · 2022-11-16T14:37:17Z

Defaulting to "if it is installed, it is activated".

I can see the point but I'm concerned about bad things that can happen here if the users does not care enough about what plugins are installed (and bad practices of environment management are common). Maybe "if it is installed, it is activated" default could be only true for an ordered white list of trusted plugins ?

using an existing plugin tool like pluggy

Personally I would wait to have more practical cases to define the kind of plugin features we're looking for, and be able to say if other existing tools can answer the needs or not. Also the current plugin implementation we propose is short and simple, at this point it can be iterated over, and replaced by something else, without too much worry I think.

jjerphan

Thanks for this proposition Tim!

I think the two idea that you propose in this PR, that is to cite part of this PR description:

The core idea of the "unified API" idea" [].

[...] get_engine_class is now get_engine_classes.

Should be proposable independently.

I appreciate both.

However, I fear that the first one can't be easily compatible with other current design constraints (e.g. inputs' validation) and future design constraints (e.g. callback API) currently.

Still I believe we might be able to find a way through all those constraints with some design discussions.

I have done a small pass, but this PR's base's diff is already quite large. Probably #24497 can be split in several pieces targeting a feature branch? What do you think @ogrisel and @fcharras?

jjerphan · 2022-11-18T15:26:28Z

sklearn/cluster/_kmeans.py

+        check_is_fitted(self)
+        engine = self._get_engine(X)
+        if hasattr(engine, "pre_transform"):
+            X = engine.pre_transform(X)
+        return self._transform(X, engine)
+
+    def _transform(self, X, engine):
+        """Guts of transform method; no input validation."""
+        X_ = engine.transform(X)
+        if hasattr(engine, "post_transform"):
+            engine.post_transform(X_)


Currently pre_transform is done in transform while post_transform is done in _transform.

Hence, are we sure that transform is the only function which calls _transform so that pre_transform will always be called? Must it always be called? Similarly post_transform is currently always called but must it be?

ogrisel · 2022-12-01T09:17:19Z

About the generic Engine API

My instinct was to not impose a specific engine API for now to keep per-estimator freedom to design the best way to make them extendable on a case by case basis and reduce friction to engine development for the first few engines.

And later once, we have implemented engines for a bunch of very heterogenous estimators (e.g. 3 or 4) we can start to decide whether or not those have redundant enough designs to refactor them to follow a unified engine API.

The other change in this PR is that get_engine_class is now get_engine_classes. It returns all matching engine classes in the order the user configured. This allows us to try them in turn until we find one that wants to handle the request. The last engine is the default engine which will handle all requests (as a fallback). Instead of using raise NotImplemented, a new method is added to engines: accepts(X, y, sample_weights). It will be called after instantiating the engine class and should return True or False to indicate if the engine wants to handle this case or not. At this point the engine has access to the estimator as well as the data. I think this is all the information it needs to make a decision.

This looks good. We could factorize the _get_engine negotiation method into a mixin class so that engine-enabled estimators do not have to reimplement it each time.

I agree this warrants that all engines implements the "accept" method.

My only worry is that accept and fit-time, engine-specific data validation can do redundant work. But maybe we can start this way and see later if we need to refactor the couple engine negotiation and fit-time data validation.

betatim · 2022-12-13T12:46:49Z

Closing in favour of #24497

ogrisel and others added 30 commits March 29, 2022 17:54

wip engines

e209cb9

wip

1647eed

fixes

87654c1

wip

b05d4ab

more specific assertion

828f797

Add docstring to the config context

5819524

add default kwarg

5df598c

Various fixes

44cbd6c

empty doc

a693d2e

Merge branch 'main' into wip-engines

e57eae7

WIP

1dbae5b

Merge branch 'main' into wip-engines

ec6baa0

wip

dd586f1

Move tolerance computation to the engine

e3c1056

wip

bd280ef

Merge branch 'main' into wip-engines

4913f9b

wip

3352c58

wip

4753143

linting

2794d26

fix MBKMeans and linting

cc36c6e

Merge branch 'main' into wip-engines

80d4ba4

Draft changelog entry

f92d63f

doc reorg

dbf607b

fix changelog entry to add the pr number

abb278a

fix test name

c1e8510

attempt at sphinx the sphinx warning

9408280

fix MBKMeans test

1fc4d79

fix: skip entry points that do not match the requested provider names…

acc4b47

… (rather than the opposite)

update config_context unit test to account for new engine_provider ke…

d0d7f95

…yword

for python < 3.10 returns an empty list when the slearn engines entry…

dcb3140

… point is missing (rather than erroring out))

fcharras and others added 12 commits September 28, 2022 12:18

Add plugin methods for predict, transform and score methods for KMeans

12fc503

ad _engine.base module

da5fe85

linting

53cb8f1

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

7cb4712

…nto wip-engines

fix a bug that caused by the pytest plugin where sklearn tests would …

d72050f

…be executed twice

__all__fixup and renaming FeatureNotCoveredByPluginError -> NotSuppor…

a299bd8

…tedByEngineError

exception __all__ fixup

2d08a91

register the _engine subpackage in setup.py

70e4ecf

add __init__.py

54d1924

fix test_engines test

1a805d7

fix _config.py conflicts

e8c5193

github-actions bot added the module:cluster label Nov 3, 2022

WIP Switch to using an engine for KNeighbors

fea8415

betatim added 3 commits November 8, 2022 15:01

Transition KNearestNeighbors to use an engine

08f4e36

Merge branch 'main' into unified-engine-api

fc6cd28

Merge branch 'main' into unified-engine-api

87a37ad

Store selected engine name

fdba217

This allows us to raise an exception when an estimator is used with a different engine than the one used to fit it.

betatim marked this pull request as ready for review November 15, 2022 09:06

betatim changed the title ~~[Draft] Engine plugin API with a "unified API"~~ Computational engine plugin API with a "generic API" Nov 15, 2022

Fix test

ed9d04a

betatim mentioned this pull request Nov 15, 2022

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497

Closed

jjerphan reviewed Nov 18, 2022

View reviewed changes

betatim closed this Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computational engine plugin API with a "generic API" #24826

Computational engine plugin API with a "generic API" #24826

Computational engine plugin API with a "generic API" #24826

Computational engine plugin API with a "generic API" #24826

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment