8000 FYI/RFC: MLR Pipeline infrastructure and fit.transform vs fit_transform · Issue #15553 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FYI/RFC: MLR Pipeline infrastructure and fit.transform vs fit_transform #15553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Nov 6, 2019 · 5 comments
Open
Labels
Milestone

Comments

@amueller
Copy link
Member
amueller commented Nov 6, 2019

I recently talked with the authors of the MLR package for R. The just changed their pipeline infrastructure to something quite similar to ours - though they developed it independently.
See this for details: https://mlr3pipelines.mlr-org.com

I wanted to point out one particular aspect of their design.
We have been having issues with fit().transform() vs fit_transform, for example in stacking and resampling.
They completely avoid the issue by having fit produce a representation of the dataset. You could say they don't have fit, they only have fit_transform, though it's just called fit basically.
That makes it very obvious that there are two separate transformation implemented by each model: the one on the training set, and the one on the test set, and no confusion can arise.
The method names are very different, so there is no expectations that the two would ever produce the same result.

I'm not sure this is a route we want to consider, but it would remove a lot of pain points we're currently seeing, and it seems like a cleaner design to me.

If we wanted to do something like that in sklearn we either need to stop returning self in fit which might be too much of a break in the API, even for a 1.0 release.
The other option would be to come up with another verb, and have fit_verb, which we kind of have for fit_resample. Though in principal it could be fit_transmogrify and do arbitrary things (like stacking).

I found it quite fascinating that months of our discussions would just resolve if we hadn't decided to have fit return self...

@amueller amueller added the API label Nov 6, 2019
@amueller amueller added this to the 1.0 milestone Nov 6, 2019
@glemaitre
Copy link
Member

Is there a bottleneck and use case for which transform will be wistful and one would only want a fitted transformer? (Most probably not in a pipeline)

@amueller
Copy link
Member Author
amueller commented Nov 8, 2019

For example KNNImputer might be expensive.

@jnothman
Copy link
Member
jnothman commented Nov 10, 2019 via email

@GaelVaroquaux
Copy link
Member

Actually, the KNN is a good example of something that you wouldn't want to apply on the train data: in a stacking context, KNN with k=1 would just memorize y, and hence not be useful.

More generally, the theoretical appealing properties of machine-learning estimates only hold on left-out data, hence one could argue that it would be better in general not to use the transform on the same data as the fit (with procedures such as "cross-fit" procedures).

@adrinjalali
Copy link
Member

Moving to 2.0.

@adrinjalali adrinjalali modified the milestones: 1.0, 2.0 Aug 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
0