-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Implement estimator tags #6599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 Now that we have the contrib, estimator tags are becoming very important. For the specific case of |
This is what we're using in auto-sklearn for the multinomial naive bayes (nb just as an example).
With this information, we are able to automatically construct all valid pipelines we can configure with auto-sklearn for a given dataset. What we don't have yet, but might be interest is a preferred data structure and dtype as input, as well as the output dtype to prevent unnecessary memory copies. If this is what you're looking for, we (developers of auto-sklearn) would be very interested in this and can help on this or even work on this depending on the amount of work and how this feature should finally look like. cc @KEggensperger who coded a lot of properties and pipeline creation in auto-sklearn |
I think the auto-sklearn stuff is a good start, but that we need to be careful to define the scope of what are useful and not useful properties to annotate. Obviously, things that affect test applicability are worth annotating somehow, but working out what exact vocabulary describes the (in)applicability of tests is non-trivial. It would be, for instance, be much easier in some senses to have a way of saying "don't run test x for this estimator", but that needs to be updated as new tests are added, as well as not being useful for the meta-model inspection used by the likes of auto-sklearn. What we include that is not relevant to testing exclusions needs to be limited, IMO. |
Just noting that if we can collect all the names of the fit attributes for an estimator in that 'tag', |
Perhaps we should resolve to one-by-one remove estimators from |
yes, that variable needs to go. We should go through the tests and see what other special cases there are apart from positive data. The bicluster methods take n_samples x n_samples affinity matrices. Maybe we should actually just change the API for that (with a precomputed=False parameter)? |
with the categorical variables in trees soon, we might also want to indicate support for that. |
It may be a long shot, but what about using pep-484 type hints for that? |
@kmike how would that work? These properties are not types. We could create mixins that correspond to the properties and do inheritance, but I feel that would be abusing the type system for a simple annotation task. |
@amueller maybe you're right; I haven't put much thought in how exactly it can be implemented. But |
only that non-negative float is not a numpy type ;) |
And mypy generally does not support Numpy typing, as far as I can tell from comments like this one |
hm... having this for 0.18 might be a bit ambitious. |
Quite! |
I can't recall if it's been raised here, but one complication of estimator tags is that traits may be dependent on parameters (e.g. |
(And again, looking to type annotations as a saviour is not the answer. We're not dealing with types but domain semantics.) |
parameter-conditional behavior is certainly a tricky thing. We kinda tried to fake it with the This somewhat ties into the discussion on default parameter grids, though they are very distinct use cases. I usually wouldn't search over different solvers when doing a grid-search, but that's exactly what we want here. half-baked proposal with two parts:
|
I think the goal should be that the common tests have decent test coverage alone btw. |
+1: let's limit the goal, and hence go for an 80/20 (or rather a 95/5) |
That is, yes, a "has predict_proba" trait does not depend on it having On 18 November 2016 at 08:36, Joel Nothman joel.nothman@gmail.com wrote:
|
There are examples where it looks like an estimator does not have Imagine someone wants to fit all models that have |
@jnothman @GaelVaroquaux if you have input on my questions above, that'll help me with creating a prototype. |
I think saying the predict_proba trait is only reliable after fit (or only making it available then) is the best we can do within reason. I'm not sure what other questions I've not answered seeing as I want an instancemethod! |
@jnothman Well but then basically all tags are only "reliable" after One of the main motivation for the tags right now is to decide which tests to apply. But we need to decide that before fitting, right? At least we need to know whether something is multi-output or not. |
And for most estimators, we know for sure. Basically the SearchCV estimators can change anything, but most tags are known before fitting in nearly all estimators. |
updated the issue description with a tentative list of things we might need to make our tests agnostic... |
Not all tags are reliable only after fit, such as whether it takes a sparse
X at fit. I suppose get_tags() is a bit frustrating if you have to do it
before and after fit.
…On 9 December 2016 at 09:10, Andreas Mueller ***@***.***> wrote:
updated the issue description with a tentative list of things we might
need to make our tests agnostic...
I'm not entirely sure how we should encode the supported input formats.
We could do something like
if "ndarray" in tags['valid_inputs']
maybe?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6599 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-_l0pUf6Kr2ScWT4JRmTrzXvcA1ks5rGIBFgaJpZM4H55Pj>
.
|
@jnothman for grid-search we can search over steps in a pipeline, so whether the estimator accepts sparse X can change. |
Playing devil's advocate, I just realized this is possible and does not result in an error: from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.neighbors import KernelDensity
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_blobs
X, y = make_blobs()
pipe = Pipeline([("thing", KMeans())])
grid = GridSearchCV(pipe, {"thing": [DecisionTreeClassifier(), KMeans(), KernelDensity()]})
grid.fit(X, y) I'm not sure how I feel about that. |
There are examples where it looks like an estimator does not have predict_proba
before fit but has it after fit, and conversely it doesn't have it before fit
but does have it after.
Actually, maybe it would be best to have two subclasses for these
estimators. One example that I have in mind if the SVC, which will have
predict_proba if "probabilities=True". This has been very confusing for
users. It might be good to add a subclass that does the probabilities.
If we can be systematic about removing those ambiguities, we have solved
your problem, and I think that we have a better library. WDYT?
|
@GaelVaroquaux for SVC, we know this actually before I'm actually wondering where outside of |
I'm actually wondering where outside of GridSearchCV we don't know
important properties before fit. If that is the only problematic case,
then maybe we should treat it (and RandomizedSearchCV and other CV
objects) differently.
+1
If someone is using param_grid={'svc__probabilities': [True, False]}, I
don't really see what the meaning of the estimator tags would be.
|
One more design question: class BaseEstimator(object):
def _get_tags(self=None):
return {} this would allow |
Hm that messes with |
Ok current issue: some models are really bad - like the Dummy ones and the NB ones, and they don't fulfill the minimum iris accuracy required in the tests. Does it make sense to try to solve that with tags? Should we not test for reasonable accuracy? Should we allow estimators to opt-out of certain tests? |
Sparse support needs to be known prior to fit, regardless of whether the
fitted model can change in its ability to do a sparse transform etc. Your
argument about GridSearch says anything with respect to the ability to be
fit on sparse data.
Yes, I think we'll need to let estimators opt out of certain tests.
…On 10 December 2016 at 08:57, Andreas Mueller ***@***.***> wrote:
Ok current issue: some models are really bad - like the Dummy ones and the
NB ones, and they don't fulfill the minimum iris accuracy required in the
tests. Does it make sense to try to solve that with tags? Should we not
test for reasonable accuracy? Should we allow estimators to opt-out of
certain tests?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6599 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67PLh2NOugOZeAjPAJZhPqZkjqHCks5rGc7PgaJpZM4H55Pj>
.
|
I'm not sure I follow. If I grid-search and the grid involves a model that supports sparse data and one that doesn't, then it's unclear whether Arguably we are interested in whether fitting the grid-search on sparse data will succeed. That requires knowing whether all grid-points support fitting on sparse data, which is tricky, but not impossible. I guess I was more thinking about "will calling
I implemented that in #8022. |
Well, yes, predict on sparse matrix can only be known after fit. I think
all models which fit on sparse matrix will predict (if they predict) on
sparse. Except we do have models that prefer a different sparse format at
predict time, but that's more than we should tag, IMO.
…On 13 December 2016 at 03:58, Andreas Mueller ***@***.***> wrote:
Sparse support needs to be known prior to fit, regardless of whether the
fitted model can change in its ability to do a sparse transform etc. Your
argument about GridSearch says anything with respect to the ability to be
fit on sparse data.
I'm not sure I follow. If I grid-search and the grid involves a model that
supports sparse data and one that doesn't, then it's unclear whether
best_estimator_ will support fitting on sparse data.
I guess that is not really what we're interested in.
Arguably we are interested in whether fitting the grid-search on sparse
data will succeed. That requires knowing whether all grid-points support
fitting on sparse data, which is tricky, but not impossible.
Basically we would need to iterate over the grid, instantiate the models,
get the sparse support tag from all of them and call all on it.
I guess I was more thinking about "will calling predict succeed on a
sparse matrix", which we can only say for sure after fit.
Yes, I think we'll need to let estimators opt out of certain tests.
I implemented that in #8022
<#8022>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6599 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_GmzkLe8GEw_Z6MBr3VgG8jx5k2ks5rHX1OgaJpZM4H55Pj>
.
|
@amueller Is this the right place to discuss the list of tags ? The new imputer classes in sklearn.impute accept both numeric and non numeric inputs (e.g. strings). It might be good to add a tag for whether an estimator accept non-numeric inputs or not. |
@jeremiedbb Probably not, I'll open a new issue? |
Currently the common test hard-code many things, like support for multi-output, or requiring positive input, or not allowing specific kinds of predictions.
That's bad design, but also a big problem for 3rd party packages that need to add to the conditions in the common tests (you need to add to the hard-coded list of estimators with a certain property).
See scikit-learn-contrib/py-earth#96 for an example.
Currently we have a very poor mechanism for distinguishing classifiers and regressors for similar purposes (but also to use when deciding the default cross-validation strategy), the
estimator_type
attribute. That allows only a single tag (classifier, transformer, regressor).I think we should deprecate
estimator_type
and instead add a more flexibleestimator_properties
dictionary.This will allow us to programmatically encode assumptions of the algorithms (like vectorizers taking non-numeric data or NB taking non-negative data) as well as clean up our act with the tests.
The people wanting to add to scikit-learn-contrib and auto-sklearn-like settings (tpot) will appreciate that ;)
List of tags that I am coming up with
cc @GaelVaroquaux @ogrisel @mblondel @jnothman @mfeurer @rhiever
The text was updated successfully, but these errors were encountered: