RFC Dataset API

@daniel-cortez-stevenson

In a few issues now we've been talking about supporting pandas.DataFrames either as an input or the output. I think it's worth taking a step back and think about why we'd like to do that.

First and foremost, it's about the metadata. In #10733 we have:

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age

The way I see it, is that it makes sense to have some [meta]data attached to the columns [and rows], and in general about the dataset itself. Now it's true that we return a Bunch object with some description and possibly other info about the data 75E2 in some of the load_... functions, but we don't really embrace those information in the rest of the library.

On the other hand, whatever we use other than a numpy.array, it would be because it gives us some extra functionalities. For instance we see in #11818:

we would also set the DataFrame's index corresponding to the is_row_identifier attribute in OpenML.

That however, raises a few questions. For instance, would we then use that index in any way? As a user, I guess if I see some methods in sklearn giving me the data in the form of a DataFrame, I'd expect the rest of sklearn to also understand DataFrames to a good extent. That may give the user the impression that we'd also understand a DataFrame with an index hierarchy and so on.

#13120 by @daniel-cortez-stevenson is kind of an effort towards having a Dataset object, and uses the fact that some other packages such as PyTorch have a dataset object as a supporting argument. However, as @rth mentioned in #13120 (comment), one role of a dataset object is to load samples in batches when it's appropriate and that's arguably the main reason for having a dataset in a library such as PyTorch.

On the other hand, we have the other issue of routing sample properties throughout the pipeline, and #9566 by @jnothman (which itself lists over 10 issues that it's potentially solve) is an effort to solve the issue, but it's pretty challenging and we don't seem to have a solution/consensus with which we're happy yet.

Another related issue/trick is that we only support transformation of the input X, and hence there's a special TransformedTargetRegressor class which handles the transformation of the target before applying the regressor, instead of allowing the transformation of both X and y.

Now assume we have a Dataset which would include the metadata of the features and samples, with some additional info that potentially would be used by some models. It would:

include feature metadata (including names)
internally keep the data as a pandas.DataFrame if necessary
include sample info (such as sample_weights)
be the input and output to/from transformers, hence allowing the transformation of the output y, and sample_weights if that's what the transformer is doing
transformers can attach extra information to the dataset

And clearly it would support input/output conversion from numpy array and probably pandas dataframes.

It would then be easy to handle some usecases such as:

a transformer/model in the pipeline can put a pointer to its self in the dataset for a model down the line to potentially use it (we had an issue about this which I cannot find now).
modifying the sample_weights in the pipeline
manipulate y in the pipeline

What I'm trying to argue here is that the DataFrame does solve some of the issues we have here, but it would probably involve quite a lot of changes in the codebase, at which point, we could contemplate the idea of having a Dataset, which seems to have the capacity to solve quite a few more issues we're having.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions