Redesign of datasets API from functional to OO #13120

daniel-cortez-stevenson · 2019-02-08T15:19:53Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This replaces the functional load_{datset name} functions in the datasets base.py module. It improves the current implementation by standardizing the attributes of Bunch objects output by the API and adding more tests. My hope is to integrate (eventually) all datasets into a single API so integration with other sklearn modules and other packages can be enhanced by having a object like Dataset with helpful methods. This is inspired by the Dataset APIs found in common frameworks for deep learning (like PyTorch, etc.).

Any other comments?

I really look forward to comments and code review. If you doubt the helpfulness of the current PR, you might be correct - but please look at as an incremental improvement which can be built upon by the community to make interacting with data in sklearn predictable and addition of more data (we love data!) easy.

If you would like to collaborate on this PR, or (hopefully) future improvements on this PR, please let me know!

Why? 1. Conform more with the estimator OO API 2. Standardize how datasets are returned 3. Make adding new datasets easy! 4. Add functionality to datasets easier - See issues #10972 and #11818 5. Integrate with other modules/packages well - create a proper `Dataset` class with methods Note: I do not think this implementation is complete or perfect, but I feel it's time to reach out to other developers about this project Work Done: 1. Introduced 3 abstract classes - `DatasetLoader` - single entrypoint method: `load` - abstract method: `_raw_data_to_bunch` - `LocalDatasetLoader` - abstract methods: `read_data` and `process` - `SimpleCSVDatasetLoader` 2. Deprecated `load_*` functions for datasets - replaced by `load` method of `DatasetLoader` - new API is executed in their docstrings 3. Made `load_data` more general and refactored - Now a method of `SimpleDatasetLoader` 4. Added tests for new API 5. API relies on passing `Bunch` between methods 6. Bunch objects have standardized attributes - there was significant variation before 7. Typing for a sklearn Dataset - `Union[Bunch, Tuple[Any, Any]]` TODO: 1. Extend `DatasetLoader` for Remote classes - `RemoteDatasetLoader`, for example 2. Add `as_frame` param to `DatsetLoader.load` - to address #10972 and #11818 3. Improve OO design (this is a first go) 4. Even more tests

To ensure upstream changes did not break contributed changes

rth · 2019-02-08T16:15:57Z

Thanks for your contribution. For such global changes, it it probably better to open an issue first for a high level discussion before starting to edit code.

Here in particular, I'm not convinced that,

    >>> from sklearn.datasets import Wine
    >>> data = Wine().load()

is better than,

    >>> from sklearn.datasets import load_wine
    >>> data = load_wine()

This is inspired by the Dataset APIs found in common frameworks for deep learning (like PyTorch, etc.).

The situation is a bit different though. In deep learning, you have to load your data in some ways so you can load it in batches etc. In scikit-learn the input data is directly a numpy array or pandas dataframe, there is no need for wrappers around those. The dataset API in scikit-learn is really only used for examples.

For more general way to represent datasets you might be interested in contributing to OpenML.

adrinjalali · 2019-02-08T16:24:21Z

@rth another potential benefit from having a dataset API might be an easier way to handle feature/sample attributes we've been having a hard time to handle.

daniel-cortez-stevenson · 2019-02-08T19:42:38Z

@rth At the API level I agree the functional programming style is friendlier. The benefit really would come when the sklearn datasets API is extended, allowing for easier maintenance and ensuring similar behavior. The proposed pull request would not make sense if the datasets API is 'closed'.

@rth I also agree that a the Datasets API in deep learning libraries is more obviously necessary. But, on a leap of faith here, I'm hoping that - through a thoughtful redesign - we may identify useful extensions of a Dataset class that are relevant to supporting reproducibility and performance of scikit-learn's more notable functionality (Estimators/Transformers).

For example, could extending datasets to act as generators increase memory or time performance of sklearn meaningfully?

I'll take a look at OpenML, and open up a fresh issue to start a high level conversation. I hope this PR serves to get some brain juices flowing.

daniel-cortez-stevenson · 2019-02-08T19:54:07Z

@adrinjalali I was pretty surprised to see the variation in the Bunch attributes being returned by the different base module functions. In the end, the Bunch class probably needs some attention to make it more targeted for moving datasets and their metadata, but I didn't want to get too crazy with this PR 😄

rth · 2019-02-08T20:28:13Z

Sure, I agree there is room for improvement. Let's continue this discussion in #13123 for now..

daniel-cortez-stevenson · 2019-02-16T14:39:12Z

I think that the conversation in #13123 is out of scope for this PR. What this PR aims to do is refactor the base.py datasets API load_### functions so that: 1) future extensions / changes (like those that may occur as a result of #13123) will be easier and safer; and 2) unnecessary heterogeneity and duplication is removed, enhancing maintainability. This is not an enhancement for sklearn users, but rather an enhancement for the maintainers.

I'd like to go ahead refactoring the fetch_### functions to RemoteDatasetLoader classes to demonstrate the advantage of this approach. Would this help the core devs? Or, do y'all have some feedback on this code as-is?

daniel-cortez-stevenson · 2019-02-16T14:47:27Z

Ah, and as a sidenote, I'm interested in refactoring in general after reading Robert Martin's 'Clean Code'. I'm hoping to flex my new refactoring muscles and contribute to a meaningful project at the same time.

If you all could point to another part of sklearn where refactoring could help, please let me know!

thomasjpfan · 2019-02-19T19:10:21Z

We can keep the dataset API private for now to enable us to refactor the underlying dataset logic and keep exposing the load_* api to users (without deprecation). This PR will strictly be to improve the code quality of sklearn.dataset.

daniel-cortez-stevenson · 2019-02-21T19:57:56Z

@thomasjpfan I'm for this. I'll revert the depreciation decorators and tests.

What about the 'load_data' function at diff 207/211? It's not publicly used (not imported in init.py) but I felt the need to deprecate it because it did not have the '_' prefix and it has a docstring. With the next commits, I would propose removing this function and other module-level 'helper' functions, without deprecation. Their functionality would be taken over by class methods of LocalDatasetLoader or `RemoteDatasetLoader. Is this likely to be OK for merge?

I will also expand this pull request to cover at least the California Housing dataset and Covtype dataset so that RemoteDatasetLoader is implemented and the level of abstraction for DatasetLoader can be more appropriately gauged. Does this seem reasonable?

Are there Labels/Milestones/Projects related to code quality, which can be assigned here?

Getting to work on this over the next few days!

jnothman · 2019-02-23T12:53:59Z

Frankly our labels are not very helpful for this kind of labelling. Milestones aren't relevant, and projects are mostly unused

thomasjpfan · 2019-03-01T08:55:34Z

Since this is for developers, having some code examples in the original post would help others understand why adding a new Datasets OO interface is a good idea.

daniel-cortez-stevenson added 2 commits February 8, 2019 16:08

Merge branch 'master' into contrib

d8d4ed5

To ensure upstream changes did not break contributed changes

daniel-cortez-stevenson mentioned this pull request Feb 8, 2019

Redesign of the sklearn.datasets API #13122

Closed

adrinjalali mentioned this pull request Feb 8, 2019

RFC Dataset API #13123

Closed

github-actions bot added the module:datasets label Mar 2, 2020

adrinjalali closed this Jan 22, 2021

adrinjalali deleted the branch scikit-learn:master January 22, 2021 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Redesign of datasets API from functional to OO #13120

Redesign of datasets API from functional to OO #13120

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Redesign of datasets API from functional to OO #13120

Redesign of datasets API from functional to OO #13120

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!