8000 Implement estimator tags · Issue #6599 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Implement estimator tags #6599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
18 tasks
amueller opened this issue Mar 28, 2016 · 47 comments · Fixed by #8022
Closed
18 tasks

Implement estimator tags #6599

amueller opened this issue Mar 28, 2016 · 47 comments · Fixed by #8022
Labels

Comments

@amueller
Copy link
Member
amueller commented Mar 28, 2016

Currently the common test hard-code many things, like support for multi-output, or requiring positive input, or not allowing specific kinds of predictions.
That's bad design, but also a big problem for 3rd party packages that need to add to the conditions in the common tests (you need to add to the hard-coded list of estimators with a certain property).
See scikit-learn-contrib/py-earth#96 for an example.

Currently we have a very poor mechanism for distinguishing classifiers and regressors for similar purposes (but also to use when deciding the default cross-validation strategy), the estimator_type attribute. That allows only a single tag (classifier, transformer, regressor).

I think we should deprecate estimator_type and instead add a more flexible estimator_properties dictionary.
This will allow us to programmatically encode assumptions of the algorithms (like vectorizers taking non-numeric data or NB taking non-negative data) as well as clean up our act with the tests.
The people wanting to add to scikit-learn-contrib and auto-sklearn-like settings (tpot) will appreciate that ;)

List of tags that I am coming up with

  • supports sparse data
  • positive data only
  • supports missing data
  • semi-supervised
  • multi-output only
  • multi-label support
  • multi-output regression
  • multi-label multi-output
  • 1d input only
  • multi-class support (or maybe "no multi-class support"?)
  • needs fitting (or maybe "stateless"? though the GP doesn't need fitting but is not stateless)
  • input dtype / dtype conversions?
  • sparse matrix formats / conversions
  • deterministic?
  • label transformation (not for data)
  • special input format? like for CountVectorizer and DictVectorizer? Or maybe we want a field "supported inputs" that lists ndarray, sparse formats, list, strings, dicts?
  • required parameters ?
  • integer input / categorical input supported?

cc @GaelVaroquaux @ogrisel @mblondel @jnothman @mfeurer @rhiever

@amueller amueller added the API label Mar 28, 2016
@amueller amueller added this to the 0.18 milestone Mar 28, 2016
@mblondel
Copy link
Member

+1 Now that we have the contrib, estimator tags are becoming very important.

For the specific case of check_estimator, estimator_properties could also be an argument of the function.

@mfeurer
Copy link
Contributor
mfeurer commented Mar 30, 2016

This is what we're using in auto-sklearn for the multinomial naive bayes (nb just as an example).

@staticmethod
def get_properties(dataset_properties=None):
    return {'shortname': 'MultinomialNB',
            'name': 'Multinomial Naive Bayes classifier',
            'handles_regression': False,
            'handles_classification': True,
            'handles_multiclass': True,
            'handles_multilabel': True,
            'is_deterministic': True,
            'input': (DENSE, SPARSE, SIGNED_DATA),
            'output': (PREDICTIONS,)}

With this information, we are able to automatically construct all valid pipelines we can configure with auto-sklearn for a given dataset. What we don't have yet, but might be interest is a preferred data structure and dtype as input, as well as the output dtype to prevent unnecessary memory copies.

If this is what you're looking for, we (developers of auto-sklearn) would be very interested in this and can help on this or even work on this depending on the amount of work and how this feature should finally look like.

cc @KEggensperger who coded a lot of properties and pipeline creation in auto-sklearn

@jnothman
Copy link
Member

I think the auto-sklearn stuff is a good start, but that we need to be careful to define the scope of what are useful and not useful properties to annotate. Obviously, things that affect test applicability are worth annotating somehow, but working out what exact vocabulary describes the (in)applicability of tests is non-trivial. It would be, for instance, be much easier in some senses to have a way of saying "don't run test x for this estimator", but that needs to be updated as new tests are added, as well as not being useful for the meta-model inspection used by the likes of auto-sklearn. What we include that is not relevant to testing exclusions needs to be limited, IMO.

@raghavrv
Copy link
Member
raghavrv commented May 4, 2016

Just noting that if we can collect all the names of the fit attributes for an estimator in that 'tag', check_is_fitted could be simplified and refactored into base estimator!

@jnothman
Copy link
Member

Perhaps we should resolve to one-by-one remove estimators from sklearn.utils.testing.DONT_TEST!

@amueller
Copy link
Member Author

yes, that variable needs to go.
I would like to add annotation that is more semantic than "don't test this".
regression vs classification is already encoded in the class hierarchy, but maybe we might want to code it again?
I would definitely like to see something that tells us sparse and dense support, possibly existence of predict_proba and decision function (which are tricky), and restrictions to positive data.

We should go through the tests and see what other special cases there are apart from positive data. The bicluster methods take n_samples x n_samples affinity matrices. Maybe we should actually just change the API for that (with a precomputed=False parameter)?

@amueller
Copy link
Member Author
amueller commented Jul 17, 2016

with the categorical variables in trees soon, we might also want to indicate support for that.

@kmike
Copy link
Contributor
kmike commented Jul 22, 2016

It may be a long shot, but what about using pep-484 type hints for that?

@amueller
Copy link
Member Author

@kmike how would that work? These properties are not types. We could create mixins that correspond to the properties and do inheritance, but I feel that would be abusing the type system for a simple annotation task.

@kmike
Copy link
Contributor
kmike commented Jul 27, 2016

@amueller maybe you're right; I haven't put much thought in how exactly it can be implemented. But handles_.. and input / output from #6599 (comment) look a lot like type annotations of corresponding method arguments and output values.

@amueller
Copy link
Member Author

only that non-negative float is not a numpy type ;)

@jnothman
Copy link
Member
jnothman commented Aug 2, 2016

And mypy generally does not support Numpy typing, as far as I can tell from comments like this one

@amueller
Copy link
Member Author

hm... having this for 0.18 might be a bit ambitious.

@jnothman
Copy link
Member

Quite!

@jnothman
Copy link
Member

I can't recall if it's been raised here, but one complication of estimator tags is that traits may be dependent on parameters (e.g. SGDClassifier can return probabilities conditioned on the loss parameter.) See also #7270 (comment)

@jnothman
Copy link
Member

(And again, looking to type annotations as a saviour is not the answer. We're not dealing with types but domain semantics.)

@amueller amueller modified the milestones: 0.19, 0.18 Aug 30, 2016
@amueller
Copy link
Member Author
amueller commented Aug 30, 2016

parameter-conditional behavior is certainly a tricky thing. We kinda tried to fake it with the delegate_has_attribute mechanism and duck typing. Will that allow us to go all the way? Probably not.

This somewhat ties into the discussion on default parameter grids, though they are very distinct use cases. I usually wouldn't search over different solvers when doing a grid-search, but that's exactly what we want here.

half-baked proposal with two parts:

  • Allow each class to "register" it's own dict of parameters to explore for testing (in a decentralized place, probably outside of the class definition?) that the common tests can pick up on.
  • have traits be a class method that gets parameters? Do we need to know before instantiation of an object? Instantiation is cheap, so we could always just instantiate and then ask for the traits in a property or method. [however, meta-estimators can not easily be instantiated, so maybe we want a class-method].

@amueller
Copy link
Member Author

I think the goal should be that the common tests have decent test coverage alone btw.

@GaelVaroquaux
Copy link
Member

I think the goal should be that the common tests have decent test coverage alone btw.

+1: let's limit the goal, and hence go for an 80/20 (or rather a 95/5)
solution. I am afraid that if we go for more ambitious goals we will get
too much sophistication going in here, and might yet not reach a goal
such as making things plug together without user understanding (because
intrinsically, machine learning doesn't work like that :D ).

@jnothman
Copy link
Member

That is, yes, a "has predict_proba" trait does not depend on it having
predict_proba before fit, but after.

On 18 November 2016 at 08:36, Joel Nothman joel.nothman@gmail.com wrote:

From a usage perspective the ability to predict_proba only pertains after
fit.

On 18 November 2016 at 04:34, Andreas Mueller notifications@github.com
wrote:

Ok but then we need to move away from tests on classes entirely and only
test on instances.
Also, what is the promise then? Deliver all properties that can be
derived before calling fit?
do we change them after fit, for example if we learn that something
gained a predict_proba during fit? (which can happen in GridSearchCV)?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6599 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6ywThf11X-Y648DT0Hq6mE45l7U1ks5q_JAQgaJpZM4H55Pj
.

@amueller
Copy link
Member Author

There are examples where it looks like an estimator does not have predict_proba before fit but has it after fit, and conversely it doesn't have it before fit but does have it after.
What do we return in each case? Do we punt before fit and say I don't know? Or do we report the current best guess? Are we ok that tags can change after fit?

Imagine someone wants to fit all models that have predict_proba. Maybe more concretely, say someone wants to wrap models with CalibratedClassifierCV if they don't have predict_proba (in a "DefinitelyHasProbabilities" meta-estimator).

@amueller
Copy link
Member Author
amueller commented Dec 7, 2016

@jnothman @GaelVaroquaux if you have input on my questions above, that'll help me with creating a prototype.

@jnothman
Copy link
Member
jnothman commented Dec 8, 2016

I think saying the predict_proba trait is only reliable after fit (or only making it available then) is the best we can do within reason.

I'm not sure what other questions I've not answered seeing as I want an instancemethod!

@amueller
Copy link
Member Author
amueller commented Dec 8, 2016

@jnothman Well but then basically all tags are only "reliable" after fit. The question is what is it before fit? We definitely want to assess these properties before fitting. Should we indicate that it might change?

One of the main motivation for the tags right now is to decide which tests to apply. But we need to decide that before fitting, right? At least we need to know whether something is multi-output or not.

@amueller
Copy link
Member Author
amueller commented Dec 8, 2016

And for most estimators, we know for sure. Basically the SearchCV estimators can change anything, but most tags are known before fitting in nearly all estimators.

@amueller
Copy link
Member Author
amueller commented Dec 8, 2016

updated the issue description with a tentative list of things we might need to make our tests agnostic...
I'm not entirely sure how we should encode the supported input formats.
We could do something like
if "ndarray" in tags['valid_inputs']
maybe?

@jnothman
Copy link
Member
jnothman commented Dec 9, 2016 via email

@amueller
Copy link
Member Author
amueller commented Dec 9, 2016

@jnothman for grid-search we can search over steps in a pipeline, so whether the estimator accepts sparse X can change.

@amueller
Copy link
Member Author
amueller commented Dec 9, 2016

Playing devil's advocate, I just realized this is possible and does not result in an error:

from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.neighbors import KernelDensity
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_blobs

X, y = make_blobs()

pipe = Pipeline([("thing", KMeans())])
grid = GridSearchCV(pipe, {"thing": [DecisionTreeClassifier(), KMeans(), KernelDensity()]})
grid.fit(X, y)

I'm not sure how I feel about that.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Dec 9, 2016 via email

@amueller
Copy link
Member Author
amueller commented Dec 9, 2016

@GaelVaroquaux for SVC, we know this actually before fit from the __init__ parameters. The biggest issue is actually GridSearchCV.

I'm actually wondering where outside of GridSearchCV we don't know important properties before fit. If that is the only problematic case, then maybe we should treat it (and RandomizedSearchCV and other CV objects) differently.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Dec 9, 2016 via email

@amueller amueller mentioned this issue Dec 9, 2016
4 tasks
@amueller
Copy link
Member Author
amueller commented Dec 9, 2016

One more design question:
how do we feel about

class BaseEstimator(object):
    def _get_tags(self=None):
        return {}

this would allow _get_tags to be called with or without instantiation.
I have not really seen that before, not sure if it is a reasonable pattern. I think declaring mandatory constructor parameters would be a good tag, but I can't access that without instantiating ...

@amueller
Copy link
Member Author
amueller commented Dec 9, 2016

Hm that messes with super though... ugh...

@amueller
Copy link
Member Author
amueller commented Dec 9, 2016

Ok current issue: some models are really bad - like the Dummy ones and the NB ones, and they don't fulfill the minimum iris accuracy required in the tests. Does it make sense to try to solve that with tags? Should we not test for reasonable accuracy? Should we allow estimators to opt-out of certain tests?

@jnothman
Copy link
Member
jnothman commented Dec 11, 2016 via email

@amueller
Copy link
Member Author

Sparse support needs to be known prior to fit, regardless of whether the fitted model can change in its ability to do a sparse transform etc. Your argument about GridSearch says anything with respect to the ability to be fit on sparse data.

I'm not sure I follow. If I grid-search and the grid involves a model that supports sparse data and one that doesn't, then it's unclear whether best_estimator_ will support fitting on sparse data.
I guess that is not really what we're interested in.

Arguably we are interested in whether fitting the grid-search on sparse data will succeed. That requires knowing whether all grid-points support fitting on sparse data, which is tricky, but not impossible.
Basically we would need to iterate over the grid, instantiate the models, get the sparse support tag from all of them and call all on it.

I guess I was more thinking about "will calling predict succeed on a sparse matrix", which we can only say for sure after fit.

Yes, I think we'll need to let estimators opt out of certain tests.

I implemented that in #8022.

@jnothman
Copy link
Member
jnothman commented Dec 12, 2016 via email

@jeremiedbb
Copy link
Member

@amueller Is this the right place to discuss the list of tags ?

The new imputer classes in sklearn.impute accept both numeric and non numeric inputs (e.g. strings). It might be good to add a tag for whether an estimator accept non-numeric inputs or not.

@amueller
Copy link
Member Author

@jeremiedbb Probably not, I'll open a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
0