8000 Redesign of datasets API from functional to OO by daniel-cortez-stevenson · Pull Request #13120 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Redesign of datasets API from functional to OO #13120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Redesign of datasets API from functional to OO #13120

wants to merge 2 commits into from

Conversation

daniel-cortez-stevenson
Copy link

Reference Issues/PRs

See also #10972 #11818 #10733

What does this implement/fix? Explain your changes.

This replaces the functional load_{datset name} functions in the datasets base.py module. It improves the current implementation by standardizing the attributes of Bunch objects output by the API and adding more tests. My hope is to integrate (eventually) all datasets into a single API so integration with other sklearn modules and other packages can be enhanced by having a object like Dataset with helpful methods. This is inspired by the Dataset APIs found in common frameworks for deep learning (like PyTorch, etc.).

Any other comments?

I really look forward to comments and code review. If you doubt the helpfulness of the current PR, you might be correct - but please look at as an incremental improvement which can be built upon by the community to make interacting with data in sklearn predictable and addition of more data (we love data!) easy.

If you would like to collaborate on this PR, or (hopefully) future improvements on this PR, please let me know!

Why?
1. Conform more with the estimator OO API
2. Standardize how datasets are returned
3. Make adding new datasets easy!
4. Add functionality to datasets easier
  - See issues #10972 and #11818
5. Integrate with other modules/packages well
  - create a proper `Dataset` class with methods

Note: I do not think this implementation is
complete or perfect, but I feel it's time
to reach out to other developers about this project

Work Done:
1. Introduced 3 abstract classes
  - `DatasetLoader`
    - single entrypoint method: `load`
    - abstract method: `_raw_data_to_bunch`
  - `LocalDatasetLoader`
    - abstract methods: `read_data` and `process`
  - `SimpleCSVDatasetLoader`
2. Deprecated `load_*` functions for datasets
  - replaced by `load` method of `DatasetLoader`
  - new API is executed in their docstrings
3. Made `load_data` more general and refactored
  - Now a method of `SimpleDatasetLoader`
4. Added tests for new API
5. API relies on passing `Bunch` between methods
6. Bunch objects have standardized attributes
  - there was significant variation before
7. Typing for a sklearn Dataset
  - `Union[Bunch, Tuple[Any, Any]]`

TODO:
1. Extend `DatasetLoader` for Remote classes
  - `RemoteDatasetLoader`, for example
2. Add `as_frame` param to `DatsetLoader.load`
  - to address #10972 and #11818
3. Improve OO design (this is a first go)
4. Even more tests
To ensure upstream changes did not break contributed changes
@rth
Copy link
Member
rth commented Feb 8, 2019

Thanks for your contribution. For such global changes, it it probably better to open an issue first for a high level discussion before starting to edit code.

Here in particular, I'm not convinced that,

    >>> from sklearn.datasets import Wine
    >>> data = Wine().load()

is better than,

    >>> from sklearn.datasets import load_wine
    >>> data = load_wine()

This is inspired by the Dataset APIs found in common frameworks for deep learning (like PyTorch, etc.).

The situation is a bit different though. In deep learning, you have to load your data in some ways so you can load it in batches etc. In scikit-learn the input data is directly a numpy array or pandas dataframe, there is no need for wrappers around those. The dataset API in scikit-learn is really only used for examples.

For more general way to represent datasets you might be interested in contributing to OpenML.

@adrinjalali
Copy link
Member

@rth another potential benefit from having a dataset API might be an easier way to handle feature/sample attributes we've been having a hard time to handle.

@daniel-cortez-stevenson
Copy link
Author

@rth At the API level I agree the functional programming style is friendlier. The benefit really would come when the sklearn datasets API is extended, allowing for easier maintenance and ensuring similar behavior. The proposed pull request would not make sense if the datasets API is 'closed'.

@rth I also agree that a the Datasets API in deep learning libraries is more obviously necessary. But, on a leap of faith here, I'm hoping that - through a thoughtful redesign - we may identify useful extensions of a Dataset class that are relevant to supporting reproducibility and performance of scikit-learn's more notable functionality (Estimators/Transformers).

For example, could extending datasets to act as generators increase memory or time performance of sklearn meaningfully?

I'll take a look at OpenML, and open up a fresh issue to start a high level conversation. I hope this PR serves to get some brain juices flowing.

@daniel-cortez-stevenson
Copy link
Author

@adrinjalali I was pretty surprised to see the variation in the Bunch attributes being returned by the different base module functions. In the end, the Bunch class probably needs some attention to make it more targeted for moving datasets and their metadata, but I didn't want to get too crazy with this PR 😄

@adrinjalali adrinjalali mentioned this pull request Feb 8, 2019
@rth
Copy link
Member
rth commented Feb 8, 2019

Sure, I agree there is room for improvement. Let's continue this discussion in #13123 for now..

@daniel-cortez-stevenson
Copy link
Author

I think that the conversation in #13123 is out of scope for this PR. What this PR aims to do is refactor the base.py datasets API load_### functions so that: 1) future extensions / changes (like those that may occur as a result of #13123) will be easier and safer; and 2) unnecessary heterogeneity and duplication is removed, enhancing maintainability. This is not an enhancement for sklearn users, but rather an enhancement for the maintainers.

I'd like to go ahead refactoring the fetch_### functions to RemoteDatasetLoader classes to demonstrate the advantage of this approach. Would this help the core devs? Or, do y'all have some feedback on this code as-is?

@daniel-cortez-stevenson
Copy link
Author

Ah, and as a sidenote, I'm interested in refactoring in general after reading Robert Martin's 'Clean Code'. I'm hoping to flex my new refactoring muscles and contribute to a meaningful project at the same time.

If you all could point to another part of sklearn where refactoring could help, please let me know!

@thomasjpfan
Copy link
Member

We can keep the dataset API private for now to enable us to refactor the underlying dataset logic and keep exposing the load_* api to users (without deprecation). This PR will strictly be to improve the code quality of sklearn.dataset.

@daniel-cortez-stevenson
Copy link
Author

@thomasjpfan I'm for this. I'll revert the depreciation decorators and tests.

What about the 'load_data' function at diff 207/211? It's not publicly used (not imported in init.py) but I felt the need to deprecate it because it did not have the '_' prefix and it has a docstring. With the next commits, I would propose removing this function and other module-level 'helper' functions, without deprecation. Their functionality would be taken over by class methods of LocalDatasetLoader or `RemoteDatasetLoader. Is this likely to be OK for merge?

I will also expand this pull request to cover at least the California Housing dataset and Covtype dataset so that RemoteDatasetLoader is implemented and the level of abstraction for DatasetLoader can be more appropriately gauged. Does this seem reasonable?

Are there Labels/Milestones/Projects related to code quality, which can be assigned here?

Getting to work on this over the next few days!

@jnothman
Copy link
Member

Frankly our labels are not very helpful for this kind of labelling. Milestones aren't relevant, and projects are mostly unused

@thomasjpfan
Copy link
Member

Since this is for developers, having some code examples in the original post would help others understand why adding a new Datasets OO interface is a good idea.

@adrinjalali adrinjalali deleted the branch scikit-learn:master January 22, 2021 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0