8000 RFC Dataset API · Issue #13123 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

RFC Dataset API #13123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adrinjalali opened this issue Feb 8, 2019 · 7 comments
Closed

RFC Dataset API #13123

adrinjalali opened this issue Feb 8, 2019 · 7 comments

Comments

@adrinjalali
Copy link
Member

In a few issues now we've been talking about supporting pandas.DataFrames either as an input or the output. I think it's worth taking a step back and think about why we'd like to do that.

First and foremost, it's about the metadata. In #10733 we have:

Users should be able to get datasets returned as [Sparse]DataFrames with named columns in this day and age

The way I see it, is that it makes sense to have some [meta]data attached to the columns [and rows], and in general about the dataset itself. Now it's true that we return a Bunch object with some description and possibly other info about the data in some of the load_... functions, but we don't really embrace those information in the rest of the library.

On the other hand, whatever we use other than a numpy.array, it would be because it gives us some extra functionalities. For instance we see in #11818:

we would also set the DataFrame's index corresponding to the is_row_identifier attribute in OpenML.

That however, raises a few questions. For instance, would we then use that index in any way? As a user, I guess if I see some methods in sklearn giving me the data in the form of a DataFrame, I'd expect the rest of sklearn to also understand DataFrames to a good extent. That may give the user the impression that we'd also understand a DataFrame with an index hierarchy and so on.

#13120 by @daniel-cortez-stevenson is kind of an effort towards having a Dataset object, and uses the fact that some other packages such as PyTorch have a dataset object as a supporting argument. However, as @rth mentioned in #13120 (comment), one role of a dataset object is to load samples in batches when it's appropriate and that's arguably the main reason for having a dataset in a library such as PyTorch.

On the other hand, we have the other issue of routing sample properties throughout the pipeline, and #9566 by @jnothman (which itself lists over 10 issues that it's potentially solve) is an effort to solve the issue, but it's pretty challenging and we don't seem to have a solution/consensus with which we're happy yet.

Another related issue/trick is that we only support transformation of the input X, and hence there's a special TransformedTargetRegressor class which handles the transformation of the target before applying the regressor, instead of allowing the transformation of both X and y.

Now assume we have a Dataset which would include the metadata of the features and samples, with some additional info that potentially would be used by some models. It would:

  • include feature metadata (including names)
  • internally keep the data as a pandas.DataFrame if necessary
  • include sample info (such as sample_weights)
  • be the input and output to/from transformers, hence allowing the transformation of the output y, and sample_weights if that's what the transformer is doing
  • transformers can attach extra information to the dataset

And clearly it would support input/output conversion from numpy array and probably pandas dataframes.

It would then be easy to handle some usecases such as:

  • a transformer/model in the pipeline can put a pointer to its self in the dataset for a model down the line to potentially use it (we had an issue about this which I cannot find now).
  • modifying the sample_weights in the pipeline
  • manipulate y in the pipeline

What I'm trying to argue here is that the DataFrame does solve some of the issues we have here, but it would probably involve quite a lot of changes in the codebase, at which point, we could contemplate the idea of having a Dataset, which seems to have the capacity to solve quite a few more issues we're having.

@rth
Copy link
Member
rth commented Feb 8, 2019

Thanks for a detailed summary!

I agree these are things that are worthwile to consider. A partial comment (mostly paraphrasing some of Gael's slides) is that using standard data contains (ndarrays, DataFrames) is what allows different packages in the eco-system to interact.

Currently, we are adding more pandas support, but there is still no full consensus as to how far this should go (pandas has quite a lot of advanced features). The only reason pandas.DataFrames were considered is because it is now a standard format for columnar data in Python. Even if we could write some custom object that could solve some of the above mentionned issues, the fact that it would not be standard in the community is a very strong limitation.

Personally as a user, I tend to somewhat resist any library that tries to push me to use their dataset wrappers (DL libraries excluding). For instance, xgboost.DMatrix probably addresses some issues -- as a user I have to admit I don't care too much, am too lazy to learn a new API and will use ndarray or DataFrames if I can. Maybe I am missing something in that particular case..

As I menioneed in #13120, I think OpenML would also be a good community to think about dataset representation.

as [Sparse]DataFrames with named columns

Side comment, I would avoid the term SparseDataFrames. These are not sparse in the sense of usually use in scikit-learn i.e. (CSR / CSC arrays) and are mostly just compressed dataframes from what I understood https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html and are not usable for applications where we use sparse arrays (e.g. text processing) last time I tried. Xarray has probably a better shot of getting sparse labeled arrays one day but it's not there yet pydata/xarray#1375

@daniel-cortez-stevenson
Copy link

For instance, xgboost.DMatrix probably addresses some issues -- as a user I have to admit I don't care too much, am too lazy to learn a new API and will use ndarray or DataFrames if I can.

Oh, yea absolutely. The xgboost.DMatrix example really hits home for me. Familiarity is essential if a Dataset API is going to be more than an "advanced" usage kind of thing.

A naive proposal:
Subclass np.ndarray as sklearn.skarray ("scikit-learn array") and extend it with the metadata attributes and constructors suggested by @adrinjalali - and maybe do something fancy to avoid namespace collision potential from modification by successive estimators/transformers or changes to numpy.

On the surface it would just act as a np.ndarray - limiting impact on usability and minimizing impact on sklearn as a whole. Transformers/Estimators would do "something", given an skarray as an X param (but would error if y is given). The skarray would know where X ends and y starts and return either with a simple skarray.X or skarray.y.

We could test if this is a terrible idea or not by working through the use case in #9566

Sidenote 1: I have a (maybe personal) problem of keeping track of class names after transforming a dataset with LabelEncoder then OneHotEncoder. Like 'good', 'bad', 'worse' would become 1, 2, 3 then be found in the output array at columns ?,??,???
A Datasets API could assist here, I think.

Sidenote: Datasets were discussed a bit in #4497

@adrinjalali
Copy link
Member Author

oops, I wasn't quite clear with the proposal. I wasn't at all suggesting that we should "move" from (X, y) style API to a strictly dataset one. I was thinking more along the lines that @daniel-cortez-stevenson also suggests, i.e.:

  • all API changes are backward compatible. We would either accept X as either an array or a dataset, or accept either X or a dataset as input (as two different parameters). Regardless of which one we go for, it would not change the experience of the user for any of the current usages.
  • internally, we'd convert the input(s) to a dataset, and pass that along the pipeline. Still, the user would get the usual array as the output, unless the specify return_dataset=True or something, which is what the Pipeline would do.
  • we could see if there's an acceptable greatest common divisor between the dataset APIs out there and see if it could be a potential starting point for a library itself. I'd say that would really benefit the community, but it's also kinda outside the scope of sklearn (maybe).

So it would indeed be more of an advanced usage if the user decides to use it. I completely agree that the learning curve for the users is an important factor in any design choice we make; what I'm suggesting is that the learning curve of this change may be even more convenient than what we may end up with w/o a dataset to handle sample property propagation, for instance.

@jnothman
Copy link
Member
jnothman commented Feb 11, 2019 via email

@daniel-cortez-stevenson

How about directing users to use a numpy.recarray if they want to explicitly associate metadata with an array and convert pandas.Dataframe inputs to numpy.recarray internally? Do y'all think this is a viable approach?

@jnothman
Copy link
Member
jnothman commented Feb 14, 2019 via email

@adrinjalali
Copy link
Member Author

Closing as I don't think we're gonna do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0