WIP: transformers that modify y #2

GaelVaroquaux · 2015-10-22T13:50:29Z

This is almost ready for early discussion:

Best way to look at the proposal is:
https://github.com/GaelVaroquaux/enhancement_proposals/blob/transform_y/slep001/discussion.rst

Option A: meta estimators RST formatting Advance the discussion Restructured text layout RST formatting iter

amueller · 2015-10-22T19:39:19Z

I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic.

amueller · 2015-10-22T19:41:31Z

slep001/discussion.rst

+   an almost case-by-case basis, and for the advanced user, that needs to
+   maintain a set of case-specific code
+
+#. The "estimator heap" problem. 


maybe "chain" instead of heap and stack?

"chain" is linear. But point taken, I am rephrasing a bit this paragraph.

The meta-estimators I had in mind are always linear, and the new objects are always linear. FeatureUnion is the only thing that can make it a DAG, right?

The difference between fit time and predict time in the case of things like the previously discussed undersampler effectively create a conditional DAG, depending on whether you are at fit or predict time

amueller · 2015-10-22T19:46:07Z

slep001/discussion.rst

+   difficulties are mostly in user code, so we don't see them too much in
+   scikit-learn. Here are concrete examples
+
+   #. Trying to retrieve coefficients from a models estimated in a


I think the point would be made stronger by a working example. You know there is a lasso in your stack and you want to get it's coef_ [in whatever space that resides?].

pipeline.named_steps['lasso'].coef_

is possible.
With a chain of meta estimators this is tricky.

I think the point would be made stronger by a working example. You know there is a lasso in your stack and you want to get it's coef_ [in whatever space that resides?].

Addressed.

amueller · 2015-10-22T20:22:31Z

Just a note: misses the part that @agramfort wanted to write, user story code.
@agramfort said he already has ugly solutions that would be nicer with the new interface. It would be good to see them.

It would probably be easy to come up with some outlier or resampling examples, but more "real live" ones would be cool.

amueller · 2015-10-22T20:24:39Z

overall looks great and I agree with the summary mostly. the "should this be a new class" discussion is touched upon only briefly, and the "should the number of return arguments vary" discussion, too.

amueller · 2015-10-22T20:25:18Z

ping @jnothman [no rush] @glouppe

GaelVaroquaux · 2015-10-22T22:26:06Z

I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic.

Thanks. Addressed.

GaelVaroquaux · 2015-10-22T23:26:32Z

Just a note: misses the part that @agramfort wanted to write, user story code. @agramfort said he already has ugly solutions that would be nicer with the new interface. It would be good to see them.

ping @agramfort

GaelVaroquaux · 2015-10-22T23:40:27Z

I just realized that this proposal is adding a conceptual complexity (and thus burden for the user):

We will now have 3 methods that can modify X and are legitimate in a pipeline (interim names for now):

fit_pipe
transform_pipe
transform

The question is: at fit time, in a pipeline, what has the priority on what, if not only one of these methods is present. In my head it is pretty clear (it's in the order listed above). But it might be confusing for the user.

We need fit_pipe and transform_pipe to exist both (and be different) to tackle things such as outlier detection. However, in most cases, we probably do not need transform_pipe

code-of-kpp · 2015-10-23T01:44:20Z

Hello!

I used to deal with such things too.

My current approach (which should work with stable sklearn) is to add some ABCs and monkey patch Pipeline to use them.

ABC is abstract base class (it has hooks for isinstance calls)

>>> lab_enc = LabelEncoder()
...
>>> isinstance(lab_enc, ABC_Y_Transformer)
True

Pipeline is patched so that if estimator step passes the isinstance check with ABC_Y_Transformer, then fit and transform methods of step will be called with Y instead of X.

My current solution for such ABC: hardcode all sklearn (and my own) classes in ABC_Y_Transformer..__instancecheck__ so that isinstance(obj, ABC_Y_Transformer) call becames obj.__class__ in predefined_set_of_classes.

This way I can put label transformers and X transformers in a single pipeline.

The same logic can be applied to other cases, where X and Y should be modified together.

Instead of this hacky ABC, sklern can have two additional mixins - LabelTransformerMixin and XYTransformerMixin. And pipeline could be patched the same way as with ABC.

This way, modifications to sklearn code will be small, new version of sklearn will be (almost) compatible with previous ones and overall API changes are small.

Modifications needed: add mixins, add parent to current label transformers, make pipeline respect new mixins.

mblondel · 2015-10-23T02:54:11Z

Could you elaborate the issues with the current design and why we can't transform y at the moment? We do have LabelBinarizer, although it's never used in a pipeline.

datnamer · 2015-10-25T18:08:56Z

As an end user my dream would be to just wrap a Blaze data source (like a sqlite db) and have it work in some sort of self- updating reactive out of core pipeline.

Odo and blaze have some chunking/streaming functionality that might work for this.

I've seen this sort of thing mentioned before by other users (can't recall the exact post, whether in a PR or the google group)... so it atleast should have a +1 :)

code-of-kpp · 2015-12-01T16:46:35Z

Is it dead?

amueller · 2015-12-02T03:25:04Z

no, it's unfinished ;)

kingjr · 2016-07-20T16:39:20Z

We're facing some difficulties in MNE-Python related to the present proposal and @GaelVaroquaux suggested that it may be useful to describe our usecases so as to motivate your decision. I hope this won't be too irrelevant.

1. Initial steps needs all samples, subsequent need a subset of samples

We often need to apply an estimator on a sliding window over X of shape (n_samples, n_features) where the samples are consecutive "time samples" and are locally correlated with one another. We're thus constructing a feature space by using adjacent samples. Consequently we're trying to fit a single coef vector of shape (n_time_samples * n_features) which predict a y vector (this is analogous to auto-regressive models). This is a non-conventional design in itself, but the relevant part wrt to the present issue comes next.

Specifically, to construct such rolling time window, we thus needs all samples. However, in the following steps, many samples happen to be detrimental to the fitting (e.g. typically, many samples have all their features = 0). We thus need to exclude these samples from the subsequent steps in the pipeline.

2. Semi supervision

It is not entirely clear to me how/whether the current proposal suggests to inform each step of the pipeline about whether each values of y was known, or has been modified by a previous step.

Specifically, one may need to do:

pipe = make_pipeline_xy(
      EstimatorFittedOnKnownYOnly(transform_values=np.nan),  # modified y==np.nan by prediction
      EstimatorFittedOnModifiedY(weight_y_known=.80),  # gives higher weights for known than modified y
)

3. Scoring

I'm unclear about what will be the 'scoring' consequences of your each alternative of the proposal.

So far, I've systematically written meta-estimators to handle these cases, but it's true that they tend to become a bit nested, and difficult to handle.

glemaitre · 2016-10-22T21:31:40Z

During the development of the imbalanced-learn, we actually confront the issue since we need to modify both X and y.

After discussing briefly with @GaelVaroquaux, he redirected us to the right discussion thread.
Therefore, we thought that it could be a good idea to present succinctly the choice of API that we did.
Subsequently, our aim is threefold: (i) we would like to be sure that our API will follow the choice of the scikit-learn community when such transformer will get in the toolbox; (ii) we would like to transfer all implemented algorithms which meet the inclusion criteria of scitkit-learn; and (iii) we would like to continue implementing "cutting-edge" methods which would be complementary to scikit-learn.

1. Base class

1.1 `SamplerMixin`

We decided to create a new Mixin estimator called SamplerMixin since that all the method aimed at resampling X and y. The estimator is made of 3 methods:

fit: we use this method for checking and computing statistics which are needed at sample time. In fact, all these processes could be moved in sample for the current algorithm, which come back to the discussion at 2.2.2.1.
sample: we use this method to perform the resampling.
fit_sample: fit followed by sample.

1.2 `BaseBinarySampler` vs. `BaseMulticlassSampler`

Since that some of the algorithms are not designed to work with multiple classes, we created 2 base classes inheriting from SamplerMixin to check for the target type and potentially raise warnings. This is permissive for the moment to be compatible with the scikit-learn base estimator.

2. Pipeline object

We modified the scikit-learn Pipeline to take into account any sampler in a pipeline.
The sampling is only performed at fit time and not at predict time. It appears that this is what one would actually want to do while correcting balancing problem; change the balancing at training but test on an imbalanced dataset.
Since TransformerMixin and SamplerMixin are distincts, we didn't got into trouble with the point 2.2.2.2-2 but at the cost of a new base class.

I hope that I don't forget anything. @chkoar @dvro correct me if this is the case.

dvro · 2016-10-22T21:57:11Z

@glemaitre I'd also add that imbalanced-learn implementation enables us to work with different approaches. Given a set of samples S = (X, y):

UndersamplingPrototype Selection: returns a new set of samples S', S' in S.
Undersampling/Prototype Generation: returns a new set of samples S', |S'| < |S|.
Oversampling/Selection: returns a new set of samples S', |S'| > |S|.
Oversampling/Generation: returns a new set of samples S', S in S'.
Relabeling: returns the same a relabeled set of samples S'.

As you can see, some of these transformations could be implemented by returning a list of index, so that, X, y = X[idx,:], y[idx]. However, other transformations actually require returning a new X and y (Prototype Generation and Relabeling). Therefore, (the way I see it) a fit_filter is would not work in this second case.

That being said, I really like @GaelVaroquaux fit_modify implementation, that also works in these 5 cases. Let us know how we can help get this released.

jnothman · 2017-02-20T06:34:22Z

I think we have a few distinct cases that should be handled separately (if at all).

Data reduction

Resampling and outlier removal at fit time can be done with CV splitters (or by transforming sample_weight, which is another question...). They're not a particularly interesting case. Other data reduction, e.g. via BIRCH, is interesting, and yet it makes sense for all these cases to share an interface and be supported by Pipeline.

I think for all of these cases imbalanced-learn has got it more-or-less right. Sampler objects should not do anything at predict/transform time, only at fit time. From the perspective of Pipeline, only fit_sample matters. @glemaitre, do you have good use-cases for having a separate sample method? I can imagine its use in out-of-core learning where we want to maintain outlier statistics for a first batch, but again pipeline does not support this case.

Until we find a compelling use-case, I think fit_sample and [fit_]transform on the same object should be disallowed by Pipeline. I am inclined to implement the imbalanced-learn API.

[Edit 2 days later: I discovered I proposed much the same at https://gist.github.com/jnothman/274710f945e311697466]

Target transformation

Transforming targets, as in "scaling y and scaling back after prediction" mentioned above, seems a better fit for a meta-estimator, as it:

bookends a particular process (applies before and after)
is just as likely to be applied around a single transformer or predictor as around a pipeline subsequence
does not seem appropriate to append the reverse transformation to a pipeline as this changes the special semantics of the last step of a pipeline.

I think we should have a ?TransformedTargetPredictor:

class TransformedTargetPredictor(BaseEstimator):
    def __init__(self, estimator, func, inverse_func=None):
        ...
    def fit(self, X, y=None, **kwargs):
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, func(y))
        return self
    def predict(self, X):
        y = self.estimator_.predict(X)
        if self.inverse_func is None:
            return y
        return self.inverse_func(y)

This would at least establish a best practice for such things, allowing people to build similar transformations (for what use case, I don't know) dependent on X, y.

Data conversion

This is the challenging case, as meta-estimators do not feel appropriate. However, it seems that we're breaking the intuition of our pipeline/scoring design if we allow this kind of transformation within a Pipeline. The contract that the output of a Pipeline's predict should look like the y that is passed into its fit seems an essential one to me. I'm therefore not convinced that it deserves sharing the same API as resampling. Hmm.

glemaitre · 2017-02-22T16:41:19Z

@glemaitre, do you have good use-cases for having a separate sample method?

I would say no. A fit_sample seems more appropriate as you suggested.

jnothman · 2017-02-22T23:55:12Z

Btw, an implementation and example of using a meta-CV splitter for resampling is at https://gist.github.com/jnothman/6bddbbcca71bdf9fd37e8495d70b42e8. Again, the main problem is its not supporting other training modification such as data reduction or insertion of label noise.

jnothman · 2017-02-23T09:20:50Z

FWIW, https://github.com/dvro/scikit-protopy/blob/master/protopy has an equivalent of "fit_[re]sample" called "reduce_data"

chkoar · 2017-02-23T09:36:38Z

@jnothman yes. @dvro is the author of the scikit-protopy and he has imnplemented some of the imbalanced-learn algorithms according to the fit_sample interface.

jnothman · 2017-02-28T01:15:00Z

One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data (scikit-learn/scikit-learn#8468). In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor do they in transform/predict time.

A potential, if perhaps unparadigmatic, solution may combine the following:

class DiscardUnlabeled(BaseEstimator):
    def fit_resample(self, X, y):
        return X[y != -1], y[y != -1]

class SemiSupervisedSplit:
    def __init__(self, base_cv):
        self.base_cv = base_cv
    def split(self, X, y):
        has_label = y != -1
        has_label_idx = np.flatnonzero(has_label)
        no_label_idx = np.flatnonzero(~has_label)
        for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]):
            yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx]

amueller · 2017-03-31T14:20:02Z

I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people. Sent from phone. Please excuse spelling and brevity. On Feb 27, 2017 8:15 PM, "Joel Nothman" <notifications@github.com> wrote: One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data. In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor does predict time. A potential, if perhaps unparadigmatic, solution may combine the following: class DiscardUnlabelled(BaseEstimator): def fit_resample(self, X, y): return X[y != -1], y[y != -1] class SemiSupervisedSplit: def __init__(self, base_cv): self.base_cv = base_cv def split(self, X, y): has_label = y != -1 has_label_idx = np.flatnonzero(has_label) no_label_idx = np.flatnonzero(~has_label) for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]): yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbcFt-74PuX85iEZ0_nmEAQkhlZOdaqks5rg3UVgaJpZM4GTxW0> .

GaelVaroquaux · 2017-03-31T14:23:42Z

I feel more and more that the most pragmatic solution might be by now to not introduce a new method, but instead allow X to be DataFrame that also contains the labels. That doesn't solve everything but it makes for a much more natural workflow for most people.

I see a few limitations: * What if X is sparse (or anything others than columns, eg free text)? * What if y is more complex than a vector? (multi label, for instance) * I worry a lot about data leakage in cross-validation. Scikit-learn is quite good at preventing that (and people keep complaining that they cannot do their weird cross-validation thing that has implicit leakage :) ).

amueller · 2017-03-31T14:33:29Z

Sparse X and text can be handled by DataFrames, tough I haven't recently checked on how good the sparse support is. Multilabel y could be handled by a hierarchical index. I agree that separating X and y less than our current API might have downsides. More flexibility allows people to screw up more. Can you give an example of the cross-validation? The .split method had access to y so you should be able to do anything. Sent from phone. Please excuse spelling and brevity.

…

On Mar 31, 2017 10:24 AM, "Gael Varoquaux" ***@***.***> wrote: > I feel more and more that the most pragmatic solution might be by now to > not introduce a new method, but instead allow X to be DataFrame that also > contains the labels. That doesn't solve everything but it makes for a much > more natural workflow for most people. I see a few limitations: * What if X is sparse (or anything others than columns, eg free text)? * What if y is more complex than a vector? (multi label, for instance) * I worry a lot about data leakage in cross-validation. Scikit-learn is quite good at preventing that (and people keep complaining that they cannot do their weird cross-validation thing that has implicit leakage :) ). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbcFg1ljv_np9GiajRuYhBj9gHofCzBks5rrQxwgaJpZM4GTxW0> .

GaelVaroquaux · 2017-03-31T14:37:29Z

Sparse X and text can be handled by DataFrames, tough I haven't recently checked on how good the sparse support is.

Not great, AFAIK, but I'd love to be proven wrong.

Multilabel y could be handled by a hierarchical index.

It makes things much harder and much more clumsy. Pandas is powerful, but hard to use.

Can you give an example of the cross-validation?

Mixing X and y in a transformer.

amueller · 2017-03-31T14:57:58Z

Text was never a problem iirc. Sparse data requires a copy and looks a bit half-baked: http://pandas.pydata.org/pandas-docs/stable/sparse.html

I don't think hierarchical indices are clumsy, though they require some understanding of pandas.

GaelVaroquaux · 2017-03-31T15:00:27Z

I don't think hierarchical indices are clumsy, though they require some understanding of pandas.

My students keep making errors that are very very hard to debug (I do too). They are handy, but lead to woodoo-style code. I don't think that it's a good basics for an API for scikit-learn.

jnothman · 2017-04-01T09:45:12Z

I don't see how this fixes things, for many cases. One major feature of pipeline is that it trains the final predictor to output predictions comparable to the evaluation ground truth. How would this be ensured if a dataframe is used?

…

On 1 Apr 2017 2:01 am, "Gael Varoquaux" ***@***.***> wrote: > I don't think hierarchical indices are clumsy, though they require some understanding of pandas. My students keep making errors that are very very hard to debug (I do too). They are handy, but lead to woodoo-style code. I don't think that it's a good basics for an API for scikit-learn. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xSijyJmtE2rntORUtk4F8sRLTacks5rrRUMgaJpZM4GTxW0> .

ENH: add usecases for transform_y

f6afffb

Option A: meta estimators RST formatting Advance the discussion Restructured text layout RST formatting iter

GaelVaroquaux force-pushed the transform_y branch from 8cc8e20 to f6afffb Compare October 22, 2015 16:57

amueller reviewed Oct 22, 2015
View reviewed changes

mend

8097313

amueller reviewed Oct 22, 2015
View reviewed changes

Address all of @amueller's comments.

092e54a

GaelVaroquaux closed this Oct 23, 2015

datnamer mentioned this pull request Mar 2, 2016

consider removing python streaming backend altogether blaze/blaze#1300

Open

datnamer mentioned this pull request Mar 18, 2016

Blaze as a statistical and ML modeling front end DSL blaze/blaze#1450

Open

GaelVaroquaux mentioned this pull request May 19, 2016

Return NaN for rows with missing values scikit-learn/scikit-learn#6800

Closed

GaelVaroquaux mentioned this pull request Jul 20, 2016

WIP: sklearn-style encoding / modularizing encoding pipelines mne-tools/mne-python#3310

Closed

jnothman mentioned this pull request Mar 28, 2017

[RFC] Allow transformers on y scikit-learn/scikit-learn#4552

Closed

jnothman mentioned this pull request Aug 27, 2017

Feature Request: Pipelining Outlier Removal scikit-learn/scikit-learn#9630

Open

amueller pushed a commit to amueller/enhancement_proposals that referenced this pull request Dec 9, 2018

Resolves scikit-learn#2 - improve Disqus documentation

a780a9f

amueller merged commit fdfe421 into scikit-learn:master Dec 27, 2018

jnothman added a commit to jnothman/enhancement_proposals that referenced this pull request Aug 26, 2019

Some text about resampling pipelines and their issues (scikit-learn#2)

b989562

This was referenced Feb 17, 2020

SLEP needed: fit_transform does something other than fit(X).transform(X) #12

Open

SLEP005: Resampler API #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: transformers that modify y #2

WIP: transformers that modify y #2

WIP: transformers that modify y #2

WIP: transformers that modify y #2

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1. Initial steps needs all samples, subsequent need a subset of samples

2. Semi supervision

3. Scoring

1. Base class

1.1 SamplerMixin

1.2 BaseBinarySampler vs. BaseMulticlassSampler

2. Pipeline object

Data reduction

Target transformation

Data conversion

1.1 `SamplerMixin`

1.2 `BaseBinarySampler` vs. `BaseMulticlassSampler`