-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
RFC Dataset API #13123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for a detailed summary! I agree these are things that are worthwile to consider. A partial comment (mostly paraphrasing some of Gael's slides) is that using standard data contains (ndarrays, DataFrames) is what allows different packages in the eco-system to interact. Currently, we are adding more pandas support, but there is still no full consensus as to how far this should go (pandas has quite a lot of advanced features). The only reason pandas.DataFrames were considered is because it is now a standard format for columnar data in Python. Even if we could write some custom object that could solve some of the above mentionned issues, the fact that it would not be standard in the community is a very strong limitation. Personally as a user, I tend to somewhat resist any library that tries to push me to use their dataset wrappers (DL libraries excluding). For instance, As I menioneed in #13120, I think OpenML would also be a good community to think about dataset representation.
Side comment, I would avoid the term SparseDataFrames. These are not sparse in the sense of usually use in scikit-learn i.e. (CSR / CSC arrays) and are mostly just compressed dataframes from what I understood https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html and are not usable for applications where we use sparse arrays (e.g. text processing) last time I tried. Xarray has probably a better shot of getting sparse labeled arrays one day but it's not there yet pydata/xarray#1375 |
Oh, yea absolutely. The A naive proposal: On the surface it would just act as a We could test if this is a terrible idea or not by working through the use case in #9566 Sidenote 1: I have a (maybe personal) problem of keeping track of class names after transforming a dataset with Sidenote: Datasets were discussed a bit in #4497 |
oops, I wasn't quite clear with the proposal. I wasn't at all suggesting that we should "move" from (X, y) style API to a strictly dataset one. I was thinking more along the lines that @daniel-cortez-stevenson also suggests, i.e.:
So it would indeed be more of an advanced usage if the user decides to use it. I completely agree that the learning curve for the users is an important factor in any design choice we make; what I'm suggesting is that the learning curve of this change may be even more convenient than what we may end up with w/o a dataset to handle sample property propagation, for instance. |
The distinction we would need to make here is in transforming Dataset into
Dataset where currently all Transformers output arrays or sparse matrices.
This would allow us to transmit feature names/provenance. (I'm not entirely
sure this use matches the conventional meaning of "dataset", btw.)
Also note that allowing DataFrame output from fetch_openml also aims to
return heterogenous dtype, not merely supporting metadata.
|
How about directing users to use a |
Recarrays are difficult. They have a different shape.
|
Closing as I don't think we're gonna do this. |
In a few issues now we've been talking about supporting
pandas.DataFrame
s either as an input or the output. I think it's worth taking a step back and think about why we'd like to do that.First and foremost, it's about the metadata. In #10733 we have:
The way I see it, is that it makes sense to have some [meta]data attached to the columns [and rows], and in general about the dataset itself. Now it's true that we return a Bunch object with some description and possibly other info about the data in some of the
load_...
functions, but we don't really embrace those information in the rest of the library.On the other hand, whatever we use other than a
numpy.array
, it would be because it gives us some extra functionalities. For instance we see in #11818:That however, raises a few questions. For instance, would we then use that index in any way? As a user, I guess if I see some methods in sklearn giving me the data in the form of a DataFrame, I'd expect the rest of sklearn to also understand DataFrames to a good extent. That may give the user the impression that we'd also understand a DataFrame with an index hierarchy and so on.
#13120 by @daniel-cortez-stevenson is kind of an effort towards having a Dataset object, and uses the fact that some other packages such as PyTorch have a dataset object as a supporting argument. However, as @rth mentioned in #13120 (comment), one role of a dataset object is to load samples in batches when it's appropriate and that's arguably the main reason for having a dataset in a library such as PyTorch.
On the other hand, we have the other issue of routing sample properties throughout the pipeline, and #9566 by @jnothman (which itself lists over 10 issues that it's potentially solve) is an effort to solve the issue, but it's pretty challenging and we don't seem to have a solution/consensus with which we're happy yet.
Another related issue/trick is that we only support transformation of the input
X
, and hence there's a specialTransformedTargetRegressor
class which handles the transformation of the target before applying the regressor, instead of allowing the transformation of both X and y.Now assume we have a Dataset which would include the metadata of the features and samples, with some additional info that potentially would be used by some models. It would:
pandas.DataFrame
if necessaryy
, andsample_weights
if that's what the transformer is doingAnd clearly it would support input/output conversion from numpy array and probably pandas dataframes.
It would then be easy to handle some usecases such as:
self
in the dataset for a model down the line to potentially use it (we had an issue about this which I cannot find now).sample_weights
in the pipeliney
in the pipelineWhat I'm trying to argue here is that the DataFrame does solve some of the issues we have here, but it would probably involve quite a lot of changes in the codebase, at which point, we could contemplate the idea of having a Dataset, which seems to have the capacity to solve quite a few more issues we're having.
The text was updated successfully, but these errors were encountered: