-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Redesign of datasets API from functional to OO #13120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign of datasets API from functional to OO #13120
Conversation
Why? 1. Conform more with the estimator OO API 2. Standardize how datasets are returned 3. Make adding new datasets easy! 4. Add functionality to datasets easier - See issues #10972 and #11818 5. Integrate with other modules/packages well - create a proper `Dataset` class with methods Note: I do not think this implementation is complete or perfect, but I feel it's time to reach out to other developers about this project Work Done: 1. Introduced 3 abstract classes - `DatasetLoader` - single entrypoint method: `load` - abstract method: `_raw_data_to_bunch` - `LocalDatasetLoader` - abstract methods: `read_data` and `process` - `SimpleCSVDatasetLoader` 2. Deprecated `load_*` functions for datasets - replaced by `load` method of `DatasetLoader` - new API is executed in their docstrings 3. Made `load_data` more general and refactored - Now a method of `SimpleDatasetLoader` 4. Added tests for new API 5. API relies on passing `Bunch` between methods 6. Bunch objects have standardized attributes - there was significant variation before 7. Typing for a sklearn Dataset - `Union[Bunch, Tuple[Any, Any]]` TODO: 1. Extend `DatasetLoader` for Remote classes - `RemoteDatasetLoader`, for example 2. Add `as_frame` param to `DatsetLoader.load` - to address #10972 and #11818 3. Improve OO design (this is a first go) 4. Even more tests
To ensure upstream changes did not break contributed changes
Thanks for your contribution. For such global changes, it it probably better to open an issue first for a high level discussion before starting to edit code. Here in particular, I'm not convinced that, >>> from sklearn.datasets import Wine
>>> data = Wine().load() is better than, >>> from sklearn.datasets import load_wine
>>> data = load_wine()
The situation is a bit different though. In deep learning, you have to load your data in some ways so you can load it in batches etc. In scikit-learn the input data is directly a numpy array or pandas dataframe, there is no need for wrappers around those. The dataset API in scikit-learn is really only used for examples. For more general way to represent datasets you might be interested in contributing to OpenML. |
@rth another potential benefit from having a dataset API might be an easier way to handle feature/sample attributes we've been having a hard time to handle. |
@rth At the API level I agree the functional programming style is friendlier. The benefit really would come when the sklearn datasets API is extended, allowing for easier maintenance and ensuring similar behavior. The proposed pull request would not make sense if the datasets API is 'closed'. @rth I also agree that a the Datasets API in deep learning libraries is more obviously necessary. But, on a leap of faith here, I'm hoping that - through a thoughtful redesign - we may identify useful extensions of a For example, could extending datasets to act as generators increase memory or time performance of sklearn meaningfully? I'll take a look at OpenML, and open up a fresh issue to start a high level conversation. I hope this PR serves to get some brain juices flowing. |
@adrinjalali I was pretty surprised to see the variation in the Bunch attributes being returned by the different base module functions. In the end, the |
Sure, I agree there is room for improvement. Let's continue this discussion in #13123 for now.. |
I think that the conversation in #13123 is out of scope for this PR. What this PR aims to do is refactor the I'd like to go ahead refactoring the |
Ah, and as a sidenote, I'm interested in refactoring in general after reading Robert Martin's 'Clean Code'. I'm hoping to flex my new refactoring muscles and contribute to a meaningful project at the same time. If you all could point to another part of sklearn where refactoring could help, please let me know! |
We can keep the dataset API private for now to enable us to refactor the underlying dataset logic and keep exposing the |
@thomasjpfan I'm for this. I'll revert the depreciation decorators and tests. What about the 'load_data' function at diff 207/211? It's not publicly used (not imported in init.py) but I felt the need to deprecate it because it did not have the '_' prefix and it has a docstring. With the next commits, I would propose removing this function and other module-level 'helper' functions, without deprecation. Their functionality would be taken over by class methods of I will also expand this pull request to cover at least the California Housing dataset and Covtype dataset so that Are there Labels/Milestones/Projects related to code quality, which can be assigned here? Getting to work on this over the next few days! |
Frankly our labels are not very helpful for this kind of labelling. Milestones aren't relevant, and projects are mostly unused |
Since this is for developers, having some code examples in the original post would help others understand why adding a new Datasets OO interface is a good idea. |
Reference Issues/PRs
See also #10972 #11818 #10733
What does this implement/fix? Explain your changes.
This replaces the functional
load_{datset name}
functions in the datasets base.py module. It improves the current implementation by standardizing the attributes of Bunch objects output by the API and adding more tests. My hope is to integrate (eventually) all datasets into a single API so integration with other sklearn modules and other packages can be enhanced by having a object likeDataset
with helpful methods. This is inspired by the Dataset APIs found in common frameworks for deep learning (like PyTorch, etc.).Any other comments?
I really look forward to comments and code review. If you doubt the helpfulness of the current PR, you might be correct - but please look at as an incremental improvement which can be built upon by the community to make interacting with data in sklearn predictable and addition of more data (we love data!) easy.
If you would like to collaborate on this PR, or (hopefully) future improvements on this PR, please let me know!