8000 WIP: transformers that modify y by GaelVaroquaux · Pull Request #2 · scikit-learn/enhancement_proposals · GitHub
[go: up one dir, main page]

Skip to content

WIP: transformers that modify y #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the commun 8000 ity.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 27, 2018

Conversation

GaelVaroquaux
Copy link
Member

This is almost ready for early discussion:

Best way to look at the proposal is:
https://github.com/GaelVaroquaux/enhancement_proposals/blob/transform_y/slep001/discussion.rst

Option A: meta estimators

RST formatting

Advance the discussion

Restructured text layout

RST formatting

iter
@amueller
Copy link
Member

I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic.

an almost case-by-case basis, and for the advanced user, that needs to
maintain a set of case-specific code

#. The "estimator heap" problem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "chain" instead of heap and stack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"chain" is linear. But point taken, I am rephrasing a bit this paragraph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meta-estimators I had in mind are always linear, and the new objects are always linear. FeatureUnion is the only thing that can make it a DAG, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between fit time and predict time in the case of things like the previously discussed undersampler effectively create a conditional DAG, depending on whether you are at fit or predict time

difficulties are mostly in user code, so we don't see them too much in
scikit-learn. Here are concrete examples

#. Trying to retrieve coefficients from a models estimated in a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point would be made stronger by a working example. You know there is a lasso in your stack and you want to get it's coef_ [in whatever space that resides?].

pipeline.named_steps['lasso'].coef_

is possible.
With a chain of meta estimators this is tricky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller
Copy link
Member

Just a note: misses the part that @agramfort wanted to write, user story code.
@agramfort said he already has ugly solutions that would be nicer with the new interface. It would be good to see them.

It would probably be easy to come up with some outlier or resampling examples, but more "real live" ones would be cool.

@amueller
Copy link
Member

overall looks great and I agree with the summary mostly. the "should this be a new class" discussion is touched upon only briefly, and the "should the number of return arguments vary" discussion, too.

@amueller
Copy link
Member

ping @jnothman [no rush] @glouppe

@GaelVaroquaux
Copy link
Member Author
GaelVaroquaux commented Oct 22, 2015 via email

@GaelVaroquaux
Copy link
Member Author
GaelVaroquaux commented Oct 22, 2015 via email

@GaelVaroquaux
Copy link
Member Author

I just realized that this proposal is adding a conceptual complexity (and thus burden for the user):

We will now have 3 methods that can modify X and are legitimate in a pipeline (interim names for now):

  • fit_pipe
  • transform_pipe
  • transform

The question is: at fit time, in a pipeline, what has the priority on what, if not only one of these methods is present. In my head it is pretty clear (it's in the order listed above). But it might be confusing for the user.

We need fit_pipe and transform_pipe to exist both (and be different) to tackle things such as outlier detection. However, in most cases, we probably do not need transform_pipe

@code-of-kpp
Copy link

Hello!

I used to deal with such things too.

My current approach (which should work with stable sklearn) is to add some ABCs and monkey patch Pipeline to use them.

ABC is abstract base class (it has hooks for isinstance calls)

>>> lab_enc = LabelEncoder()
...
>>> isinstance(lab_enc, ABC_Y_Transformer)
True

Pipeline is patched so that if estimator step passes the isinstance check with ABC_Y_Transformer, then fit and transform methods of step will be called with Y instead of X.

My current solution for such ABC: hardcode all sklearn (and my own) classes in ABC_Y_Transformer..__instancecheck__ so that isinstance(obj, ABC_Y_Transformer) call becames obj.__class__ in predefined_set_of_classes.

This way I can put label transformers and X transformers in a single pipeline.

The same logic can be applied to other cases, where X and Y should be modified together.

Instead of this hacky ABC, sklern can have two additional mixins - LabelTransformerMixin and XYTransformerMixin. And pipeline could be patched the same way as with ABC.

This way, modifications to sklearn code will be small, new version of sklearn will be (almost) compatible with previous ones and overall API changes are small.

Modifications needed: add mixins, add parent to current label transformers, make pipeline respect new mixins.

@mblondel
Copy link
Member

Could you elaborate the issues with the current design and why we can't transform y at the moment? We do have LabelBinarizer, although it's never used in a pipeline.

@datnamer
Copy link

As an end user my dream would be to just wrap a Blaze data source (like a sqlite db) and have it work in some sort of self- updating reactive out of core pipeline.

Odo and blaze have some chunking/streaming functionality that might work for this.

I've seen this sort of thing mentioned before by other users (can't recall the exact post, whether in a PR or the google group)... so it atleast should have a +1 :)

@code-of-kpp
Copy link

Is it dead?

@amueller
Copy link
Member
amueller commented Dec 2, 2015

no, it's unfinished ;)

@kingjr
Copy link
kingjr commented Jul 20, 2016

We're facing some difficulties in MNE-Python related to the present proposal and @GaelVaroquaux suggested that it may be useful to describe our usecases so as to motivate your decision. I hope this won't be too irrelevant.

1. Initial steps needs all samples, subsequent need a subset of samples

We often need to apply an estimator on a sliding window over X of shape (n_samples, n_features) where the samples are consecutive "time samples" and are locally correlated with one another. We're thus constructing a feature space by using adjacent samples. Consequently we're trying to fit a single coef vector 8000 of shape (n_time_samples * n_features) which predict a y vector (this is analogous to auto-regressive models). This is a non-conventional design in itself, but the relevant part wrt to the present issue comes next.

Specifically, to construct such rolling time window, we thus needs all samples. However, in the following steps, many samples happen to be detrimental to the fitting (e.g. typically, many samples have all their features = 0). We thus need to exclude these samples from the subsequent steps in the pipeline.

2. Semi supervision

It is not entirely clear to me how/whether the current proposal suggests to inform each step of the pipeline about whether each values of y was known, or has been modified by a previous step.

Specifically, one may need to do:

pipe = make_pipeline_xy(
      EstimatorFittedOnKnownYOnly(transform_values=np.nan),  # modified y==np.nan by prediction
      EstimatorFittedOnModifiedY(weight_y_known=.80),  # gives higher weights for known than modified y
)

3. Scoring

I'm unclear about what will be the 'scoring' consequences of your each alternative of the proposal.

So far, I've systematically written meta-estimators to handle these cases, but it's true that they tend to become a bit nested, and difficult to handle.

@glemaitre
Copy link
Member
glemaitre commented Oct 22, 2016

During the development of the imbalanced-learn, we actually confront the issue since we need to modify both X and y.

After discussing briefly with @GaelVaroquaux, he redirected us to the right discussion thread.
Therefore, we thought that it could be a good idea to present succinctly the choice of API that we did.
Subsequently, our aim is threefold: (i) we would like to be sure that our API will follow the choice of the scikit-learn community when such transformer will get in the toolbox; (ii) we would like to transfer all implemented algorithms which meet the inclusion criteria of scitkit-learn; and (iii) we would like to continue implementing "cutting-edge" methods which would be complementary to scikit-learn.

1. Base class

1.1 SamplerMixin

We decided to create a new Mixin estimator called SamplerMixin since that all the method aimed at resampling X and y. The estimator is made of 3 methods:

  • fit: we use this method for checking and computing statistics which are needed at sample time. In fact, all these processes could be moved in sample for the current algorithm, which come back to the discussion at 2.2.2.1.
  • sample: we use this method to perform the resampling.
  • fit_sample: fit followed by sample.

1.2 BaseBinarySampler vs. BaseMulticlassSampler

Since that some of the algorithms are not designed to work with multiple classes, we created 2 base classes inheriting from SamplerMixin to check for the target type and potentially raise warnings. This is permissive for the moment to be compatible with the scikit-learn base estimator.

2. Pipeline object

We modified the scikit-learn Pipeline to take into account any sampler in a pipeline.
The sampling is only performed at fit time and not at predict time. It appears that this is what one would actually want to do while correcting balancing problem; change the balancing at training but test on an imbalanced dataset.
Since TransformerMixin and SamplerMixin are distincts, we didn't got into trouble with the point 2.2.2.2-2 but at the cost of a new base class.

I hope that I don't forget anything. @chkoar @dvro correct me if this is the case.

@dvro
Copy link
dvro commented Oct 22, 2016

@glemaitre I'd also add that imbalanced-learn implementation enables us to work with different approaches. Given a set of samples S = (X, y):

  1. UndersamplingPrototype Selection: returns a new set of samples S', S' in S.
  2. Undersampling/Prototype Generation: returns a new set of samples S', |S'| < |S|.
  3. Oversampling/Selection: returns a new set of samples S', |S'| > |S|.
  4. Oversampling/Generation: returns a new set of samples S', S in S'.
  5. Relabeling: returns the same a relabeled set of samples S'.

As you can see, some of these transformations could be implemented by returning a list of index, so that, X, y = X[idx,:], y[idx]. However, other transformations actually require returning a new X and y (Prototype Generation and Relabeling). Therefore, (the way I see it) a fit_filter is would not work in this second case.

That being said, I really like @GaelVaroquaux fit_modify implementation, that also works in these 5 cases. Let us know how we can help get this released.

@jnothman
Copy link
Member
jnothman commented Feb 20, 2017

I think we have a few distinct cases that should be handled separately (if at all).

Data reduction

Resampling and outlier removal at fit time can be done with CV splitters (or by transforming sample_weight, which is another question...). They're not a particularly interesting case. Other data reduction, e.g. via BIRCH, is interesting, and yet it makes sense for all these cases to share an interface and be supported by Pipeline.

I think for all of these cases imbalanced-learn has got it more-or-less right. Sampler objects should not do anything at predict/transform time, only at fit time. From the perspective of Pipeline, only fit_sample matters. @glemaitre, do you have good use-cases for having a separate sample method? I can imagine its use in out-of-core learning where we want to maintain outlier statistics for a first batch, but again pipeline does not support this case.

Until we find a compelling use-case, I think fit_sample and [fit_]transform on the same object should be disallowed by Pipeline. I am inclined to implement the imbalanced-learn API.

[Edit 2 days later: I discovered I proposed much the same at https://gist.github.com/jnothman/274710f945e311697466]

Target transformation

Transforming targets, as in "scaling y and scaling back after prediction" mentioned above, seems a better fit for a meta-estimator, as it:

  • bookends a particular process (applies before and after)
  • is just as likely to be applied around a single transformer or predictor as around a pipeline subsequence
  • does not seem appropriate to append the reverse transformation to a pipeline as this changes the special semantics of the last step of a pipeline.

I think we should have a ?TransformedTargetPredictor:

class TransformedTargetPredictor(BaseEstimator):
    def __init__(self, estimator, func, inverse_func=None):
        ...
    def fit(self, X, y=None, **kwargs):
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, func(y))
        return self
    def predict(self, X):
        y = self.estimator_.predict(X)
        if self.inverse_func is None:
            return y
        return self.inverse_func(y)

This would at least establish a best practice for such things, allowing people to build similar transformations (for what use case, I don't know) dependent on X, y.

Data conversion

This is the challenging case, as meta-estimators do not feel appropriate. However, it seems that we're breaking the intuition of our pipeline/scoring design if we allow this kind of transformation within a Pipeline. The contract that the output of a Pipeline's predict should look like the y that is passed into its fit seems an essential one to me. I'm therefore not convinced that it deserves sharing the same API as resampling. Hmm.

@glemaitre
Copy link
Member

@glemaitre, do you have good use-cases for having a separate sample method?

I would say no. A fit_sample seems more appropriate as you suggested.

@jnothman
Copy link
Member

Btw, an implementation and example of using a meta-CV splitter for resampling is at https://gist.github.com/jnothman/6bddbbcca71bdf9fd37e8495d70b42e8. Again, the main problem is its not supporting other training modification such as data reduction or insertion of label noise.

@jnothman
Copy link
Member

FWIW, https://github.com/dvro/scikit-protopy/blob/master/protopy has an equivalent of "fit_[re]sample" called "reduce_data"

@chkoar
Copy link
chkoar commented Feb 23, 2017

@jnothman yes. @dvro is the author of the scikit-protopy and he has imnplemented some of the imbalanced-learn algorithms according to the fit_sample interface.

@jnothman
Copy link
Member
jnothman commented Feb 28, 2017

One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data (scikit-learn/scikit-learn#8468). In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor do they in transform/predict time.

A potential, if perhaps unparadigmatic, solution may combine the following:

class DiscardUnlabeled(BaseEstimator):
    def fit_resample(self, X, y):
        return X[y != -1], y[y != -1]

class SemiSupervisedSplit:
    def __init__(self, base_cv):
        self.base_cv = base_cv
    def split(self, X, y):
        has_label = y != -1
        has_label_idx = np.flatnonzero(has_label)
        no_label_idx = np.flatnonzero(~has_label)
        for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]):
            yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx]

@amueller
Copy link
Member
amueller commented Mar 31, 2017 via email

@GaelVaroquaux
Copy link
Member Author
GaelVaroquaux commented Mar 31, 2017 via email

@amueller
Copy link
Member
amueller commented Mar 31, 2017 via email

@GaelVaroquaux
Copy link
Member Author
GaelVaroquaux commented Mar 31, 2017 via email

@amueller
Copy link
Member

Text was never a problem iirc. Sparse data requires a copy and looks a bit half-baked: http://pandas.pydata.org/pandas-docs/stable/sparse.html

I don't think hierarchical indices are clumsy, though they require some understanding of pandas.

@GaelVaroquaux
Copy link
Member Author
GaelVaroquaux commented Mar 31, 2017 via email

@jnothman
Copy link
Member
jnothman commented Apr 1, 2017 via email

amueller pushed a commit to amueller/enhancement_proposals that referenced this pull request Dec 9, 2018
@amueller amueller merged commit fdfe421 into scikit-learn:master Dec 27, 2018
jnothman added a commit to jnothman/enhancement_proposals that referenced this pull request Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0