-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Use dataset factories for estimator checks #17544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…into dataset_factory # Conflicts: # sklearn/utils/estimator_checks.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @rtavenar ! Yeah I was thinking something along these lines. Maybe,
def default_dataset_factory(n_samples=50, kind='random', **tags):
...
and then used as
dataset_factory(kind='random', **estimator.get_tags())
One thing is we would need something like kind
parameter because I suspect we would still need different types of datasets even e.g. for dense data. WDYT?
BTW, there might be some intersection with generating datasets for benchmarks in #17026 cc @jeremiedbb Edit: pinging a potential reviews just to get some general feedback before implementing this further. |
Yes, you're right that tags already contain a lot of useful information, I will use those when possible. |
Reference Issues/PRs
This is a follow-up of #14057
What does this implement/fix? Explain your changes.
The idea here is to provide a dataset factory to the estimator checks. This would allow downstream packages to define their own dataset factory to check their estimators on the specific kind of data they expect as input.
Any other comments?
This first commit is a small step towards implementing @rth 's suggestion, just to make sure that I did not misunderstand his suggestion.
The idea would be to use the dataset factory for all
check_*
functions (or at least for as many functions as possible).