Description
In a few issues now we've been talking about supporting pandas.DataFrame
s either as an input or the output. I think it's worth taking a step back and think about why we'd like to do that.
First and foremost, it's about the metadata. In #10733 we have:
Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age
The way I see it, is that it makes sense to have some [meta]data attached to the columns [and rows], and in general about the dataset itself. Now it's true that we return a Bunch object with some description and possibly other info about the data
75E2
in some of the load_...
functions, but we don't really embrace those information in the rest of the library.
On the other hand, whatever we use other than a numpy.array
, it would be because it gives us some extra functionalities. For instance we see in #11818:
we would also set the DataFrame's index corresponding to the
is_row_identifier
attribute in OpenML.
That however, raises a few questions. For instance, would we then use that index in any way? As a user, I guess if I see some methods in sklearn giving me the data in the form of a DataFrame, I'd expect the rest of sklearn to also understand DataFrames to a good extent. That may give the user the impression that we'd also understand a DataFrame with an index hierarchy and so on.
#13120 by @daniel-cortez-stevenson is kind of an effort towards having a Dataset object, and uses the fact that some other packages such as PyTorch have a dataset object as a supporting argument. However, as @rth mentioned in #13120 (comment), one role of a dataset object is to load samples in batches when it's appropriate and that's arguably the main reason for having a dataset in a library such as PyTorch.
On the other hand, we have the other issue of routing sample properties throughout the pipeline, and #9566 by @jnothman (which itself lists over 10 issues that it's potentially solve) is an effort to solve the issue, but it's pretty challenging and we don't seem to have a solution/consensus with which we're happy yet.
Another related issue/trick is that we only support transformation of the input X
, and hence there's a special TransformedTargetRegressor
class which handles the transformation of the target before applying the regressor, instead of allowing the transformation of both X and y.
Now assume we have a Dataset which would include the metadata of the features and samples, with some additional info that potentially would be used by some models. It would:
- include feature metadata (including names)
- internally keep the data as a
pandas.DataFrame
if necessary - include sample info (such as sample_weights)
- be the input and output to/from transformers, hence allowing the transformation of the output
y
, andsample_weights
if that's what the transformer is doing - transformers can attach extra information to the dataset
And clearly it would support input/output conversion from numpy array and probably pandas dataframes.
It would then be easy to handle some usecases such as:
- a transformer/model in the pipeline can put a pointer to its
self
in the dataset for a model down the line to potentially use it (we had an issue about this which I cannot find now). - modifying the
sample_weights
in the pipeline - manipulate
y
in the pipeline
What I'm trying to argue here is that the DataFrame does solve some of the issues we have here, but it would probably involve quite a lot of changes in the codebase, at which point, we could contemplate the idea of having a Dataset, which seems to have the capacity to solve quite a few more issues we're having.