8000 API split "model" parameters from "config" params. · Issue #17566 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

API split "model" parameters from "config" params. #17566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adrinjalali opened this issue Jun 11, 2020 · 16 comments · May be fixed by #17929
Open

API split "model" parameters from "config" params. #17566

adrinjalali opened this issue Jun 11, 2020 · 16 comments · May be fixed by #17929
Labels

Comments

@adrinjalali
Copy link
Member

We can think of our estimators' parameters as two different types:

  • model hyper-parameters which influence the end result, accuracy, complexity, etc., e.g. SVM's nu or the number of trees in a random forest.
  • config parameters which control other aspects of an estimator such as verbosity, copy_X, return_xyz, etc.

We could split these two types of parameters and have model hyperparams as __init__ args and the rest as set_params or a new configure args. This would also allow us to introduce new configuration parameters to all estimators w/o having to add another arg to __init__. An example would be the routing parameter required for sample props implementation, or potentially controlling log level at the estimator level.

My proposal would be to have a config_params estimator tag which includes the names of those parameters, and have clone, get_params and set_params understand them, and potentially introduce a new configure method to set those parameters.

This also would mean partially relaxing the set only __init__ params in __init__ requirement since __init__ would be allowed to set those extra params.

cc: @scikit-learn/core-devs

also related to: #17441 (comment)

@thomasjpfan
Copy link
Member

I agree, having an estimator tag, config_params, to mark config parameters is useful.

We could split these two types of parameters and have model hyperparams as init args and the rest as set_params or a new configure args.

Can you provide a code snippet on what the configure args will look like?

@adrinjalali
Copy link
Member Author

It's really the same as set_params, returning self, except that it raises on anything which is not on config_params. So the user would construct and configure the estimator with:

clf = SVC(kernel='rbf').configure(verbose=2, cache_size=300)

We could potentially have a decorator which sets the configure signature to accept the actual parameters, but that's probably too much magic.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jun 12, 2020 via email

@ogrisel
Copy link
Member
ogrisel commented Jun 12, 2020

I have the feeling that the distinction between config and hyper params will have a grey zone.

For instance, for the cache_size parameter of SVM, one might still be interested in using it in GridSearcCV to see the impact on the fit duration of this parameter cross-producted with "regular" hyperparams such as C and gamma for instance.

@NicolasHug
Copy link
Member

It seems to me that this add user-facing complexity, while its original motivation (#17441 (comment)) is really related to solving an internal problem.

I feel like we might have additional motivations to #17441 (comment) for having model/config params. Distinguishing between those parameters that may impact a model's predictions from those that can't can be valuable to users who don't really know where to start.

Also we might want to consider merging this discussion with that of having automatic parameter search spaces?

@jeremiedbb
Copy link
Member

I agree that having a way to distinguish between hyperparams and config params is valuable. I think we can have that without removing them from the init, keeping the current simplicity of the API.

We could have a dict as class attribute of all estimators which stores for each param what type and values it can take and if it's a hyper param or not. This way we can easily:

@rth
Copy link
Member
rth commented Jun 12, 2020

which stores for each param what type and values it can take

Some of that can probably be done with type annotations, as well.

@jnothman
Copy link
Member

Distinguishing between those parameters that may impact a model's predictions from those that can't can be valuable to users who don't really know where to start.

This is a documentation issue, not an API issue. Now that we are using kwargs, I think it would be good to change numpydoc so that a project can have custom "* Parameters" sections. The "Parameters" section made sense when everything was positional.

@TomDLT TomDLT closed this as completed Jun 14, 2020
@TomDLT TomDLT reopened this Jun 14, 2020
@amueller
Copy link
Member

I think it will be quite hard to agree on a definition. @thomasjpfan and me discussed this to some length because of the hyper-parameter specification project he's working on.

I think having better documentation and/or machine-readable versions while still keeping it in __init__ would be valuable, but I'm also skeptical that we can find a clean definition. n_jobs can potentially change the results, right?

@adrinjalali
Copy link
Member Author

Another way to look at it, is common config vs advanced config. We're increasingly adding functionalities which would be easier to implement if we didn't have to make the choice of adding an __init__ param to every single estimator or putting it as the global sklearn configuration.

It would also make it easier for users to start with the library since they wouldn't be flooded with information on __init__ params as they start.

Another way it would benefit users would be that we'd be implementing those features faster and therefore the users having access to them sooner, so it's not just us not putting the effort of adding them to every single __init__.

It would also give third party developers more flexibility since this is an area where many external estimators don't comply with our API. Having a way for them to comply would make the library more extendable.

As a start, we don't have to remove anything from our __init__s, we can simply have an API to add new ones w/o adding them as a constructor parameter.

@mitar
Copy link
Contributor
mitar commented Jul 2, 2020

Thank you for opening this issue. I can share a bit of our experience in the D3M program where we are working on AutoML tools, including use of sklearn for AutoML. I think distinguishing model and config parameters is useful, but as already noted in other comments, it is hard to draw a line and some parameters are both. To address all that, we allow parameters to belong to more than one category. Moreover, we in fact have multiple categories of parameters:

  • Control parameters (what you call config here). For example, "drop column with index X". You do not want to tune that. So things which influence the logic of the program.
  • Tuning parameters (one which influence the score, but not the logic).
  • Resources use parameters (n_jobs for example).

We have few more, but they specific to our needs (like metafeature parameter, which controls which metafeatures do we compute while running the program).

And then each parameter has a list of categories the author of the model provides. And I think this is important point/realization here. It is OK if things are hard to define. What this just tells what author of the model thinks is the main category or categories this parameter will be used as. It does not have to be perfect. User can ignore it. Or determine their own interpretation. But it is useful also to capture author's intention/assumption.

Which leads to my last point, that personally, i would keep all of them inside __init__, but use other means to mark them with this metadata. This way it is not too critical if categorization is wrong, you do not have to change the API. It is just metadata. If it is wrong, you can fix it in the next version. But moving those parameters to another method call makes this impossible.

So all for having this extra metadata, even if imperfect. But make it optional. Maybe it can just be rendered in documentation as tags next to the parameter name.

@jnothman
Copy link
Member
jnothman commented Jul 2, 2020

Really helpful input. I would still like them to be grouped in numpydoc sections if possible... Or maybe we just need to be able to display them with bootstrap-style labels as we have done with our change log. Maybe it's easy to just add such tags to each parameter docstring and then be able to extract such annotations... Or make it machine readable first.

@mitar
Copy link
Contributor
mitar commented Jul 2, 2020

Hm, grouping is hard if parameter belongs to multiple categories. :-) And yes, let's make it machine readable first.

@jnothman
Copy link
Member
jnothman commented Jul 2, 2020

If you had sphinx-friendly shorthand in docstrings, like |control|, |tuning|, |resources|, it would not be hard to make it machine readable, while keeping the annotations together with the parameter listing.

@mitar
Copy link
Contributor
mitar commented Jul 2, 2020

@jnothman For that, I like #17799 which does the opposite. Moves metadata out and then update docstrings from the metadata instead.

@jnothman
Copy link
Member
jnothman commented Jul 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

0