8000 [WIP] Use dataset factories for estimator checks by rtavenar · Pull Request #17544 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Use dataset factories for estimator checks #17544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rtavenar
Copy link
Contributor
@rtavenar rtavenar commented Jun 9, 2020

Reference Issues/PRs

This is a follow-up of #14057

What does this implement/fix? Explain your changes.

The idea here is to provide a dataset factory to the estimator checks. This would allow downstream packages to define their own dataset factory to check their estimators on the specific kind of data they expect as input.

Any other comments?

This first commit is a small step towards implementing @rth 's suggestion, just to make sure that I did not misunderstand his suggestion.
The idea would be to use the dataset factory for all check_* functions (or at least for as many functions as possible).

…into dataset_factory

# Conflicts:
#	sklearn/utils/estimator_checks.py
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @rtavenar ! Yeah I was thinking something along these lines. Maybe,

def default_dataset_factory(n_samples=50, kind='random', **tags):
    ...

and then used as

dataset_factory(kind='random', **estimator.get_tags())

One thing is we would need something like kind parameter because I suspect we would still need different types of datasets even e.g. for dense data. WDYT?

@rth
Copy link
Member
rth commented Jun 9, 2020

BTW, there might be some intersection with generating datasets for benchmarks in #17026 cc @jeremiedbb

Edit: pinging a potential reviews just to get some general feedback before implementing this further.

@rth rth requested review from NicolasHug and glemaitre June 9, 2020 12:43
@rtavenar
Copy link
Contributor Author
rtavenar commented Jun 9, 2020

Yes, you're right that tags already contain a lot of useful information, I will use those when possible.

Base automatically changed from master to main January 22, 2021 10:52
@glemaitre glemaitre removed their request for review December 16, 2021 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0