-
-
Notifications
You must be signed in to change notification settings - Fork 34
WIP: transformers that modify y #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
< 8000 div class="d-flex flex-items-center"> Sign up for GitHubBy clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Option A: meta estimators RST formatting Advance the discussion Restructured text layout RST formatting iter
8cc8e20
to
f6afffb
Compare
I think you should state more clearly that this is for within a chain or processing sequence of estimators. Doing many of these things is possible "by hand". the point is that you don't want to write custom connecting logic. |
an almost case-by-case basis, and for the advanced user, that needs to | ||
maintain a set of case-specific code | ||
|
||
#. The "estimator heap" problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "chain" instead of heap and stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"chain" is linear. But point taken, I am rephrasing a bit this paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meta-estimators I had in mind are always linear, and the new objects are always linear. FeatureUnion
is the only thing that can make it a DAG, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between fit
time and predict
time in the case of things like the previously discussed undersampler effectively create a conditional DAG, depending on whether you are at fit
or predict
time
difficulties are mostly in user code, so we don't see them too much in | ||
scikit-learn. Here are concrete examples | ||
|
||
#. Trying to retrieve coefficients from a models estimated in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point would be made stronger by a working example. You know there is a lasso
in your stack and you want to get it's coef_
[in whatever space that resides?].
pipeline.named_steps['lasso'].coef_
is possible.
With a chain of meta estimators this is tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: misses the part that @agramfort wanted to write, user story code. It would probably be easy to come up with some outlier or resampling examples, but more "real live" ones would be cool. |
overall looks great and I agree with the summary mostly. the "should this be a new class" discussion is touched upon only briefly, and the "should the number of return arguments vary" discussion, too. |
I think you should state more clearly that this is for within a chain or
processing sequence of estimators. Doing many of these things is possible "by
hand". the point is that you don't want to write custom connecting logic.
Thanks. Addressed.
|
Just a note: misses the part that @agramfort wanted to write, user story code.
@agramfort said he already has ugly solutions that would be nicer with the new
interface. It would be good to see them.
ping @agramfort
|
I just realized that this proposal is adding a conceptual complexity (and thus burden for the user): We will now have 3 methods that can modify X and are legitimate in a pipeline (interim names for now):
The question is: at fit time, in a pipeline, what has the priority on what, if not only one of these methods is present. In my head it is pretty clear (it's in the order listed above). But it might be confusing for the user. We need fit_pipe and transform_pipe to exist both (and be different) to tackle things such as outlier detection. However, in most cases, we probably do not need transform_pipe |
Hello! I used to deal with such things too. My current approach (which should work with stable sklearn) is to add some ABCs and monkey patch ABC is abstract base class (it has hooks for >>> lab_enc = LabelEncoder()
...
>>> isinstance(lab_enc, ABC_Y_Transformer)
True Pipeline is patched so that if estimator My current solution for such ABC: hardcode all sklearn (and my own) classes in This way I can put label transformers and The same logic can be applied to other cases, where Instead of this hacky ABC, sklern can have two additional mixins - This way, modifications to sklearn code will be small, new version of sklearn will be (almost) compatible with previous ones and overall API changes are small. Modifications needed: add mixins, add parent to current label transformers, make pipeline respect new mixins. |
Could you elaborate the issues with the current design and why we can't transform y at the moment? We do have LabelBinarizer, although it's never used in a pipeline. |
As an end user my dream would be to just wrap a Blaze data source (like a sqlite db) and have it work in some sort of self- updating reactive out of core pipeline. Odo and blaze have some chunking/streaming functionality that might work for this. I've seen this sort of thing mentioned before by other users (can't recall the exact post, whether in a PR or the google group)... so it atleast should have a +1 :) |
Is it dead? |
no, it's unfinished ;) |
We're facing some difficulties in MNE-Python related to the present proposal and @GaelVaroquaux suggested that it may be useful to describe our usecases so as to motivate your decision. I hope this won't be too irrelevant. 1. Initial steps needs all samples, subsequent need a subset of samplesWe often need to apply an estimator on a sliding window over Specifically, to construct such rolling time window, we thus needs all samples. However, in the following steps, many samples happen to be detrimental to the fitting (e.g. typically, many samples have all their features = 0). We thus need to exclude these samples from the subsequent steps in the pipeline. 2. Semi supervisionIt is not entirely clear to me how/whether the current proposal suggests to inform each step of the pipeline about whether each values of Specifically, one may need to do:
3. ScoringI'm unclear about what will be the 'scoring' consequences of your each alternative of the proposal. So far, I've systematically written meta-estimators to handle these cases, but it's true that they tend to become a bit nested, and difficult to handle. |
During the development of the imbalanced-learn, we actually confront the issue since we need to modify both After discussing briefly with @GaelVaroquaux, he redirected us to the right discussion thread. 1. Base class1.1
|
@glemaitre I'd also add that imbalanced-learn implementation enables us to work with different approaches. Given a set of samples S = (X, y):
As you can see, some of these transformations could be implemented by returning a list of index, so that, That being said, I really like @GaelVaroquaux fit_modify implementation, that also works in these 5 cases. Let us know how we can help get this released. |
I think we have a few distinct cases that should be handled separately (if at all). Data reductionResampling and outlier removal at fit time can be done with CV splitters (or by transforming I think for all of these cases imbalanced-learn has got it more-or-less right. Sampler objects should not do anything at predict/transform time, only at Until we find a compelling use-case, I think [Edit 2 days later: I discovered I proposed much the same at https://gist.github.com/jnothman/274710f945e311697466] Target transformationTransforming targets, as in "scaling y and scaling back after prediction" mentioned above, seems a better fit for a meta-estimator, as it:
I think we should have a ?
This would at least establish a best practice for such things, allowing people to build similar transformations (for what use case, I don't know) dependent on Data conversionThis is the challenging case, as meta-estimators do not feel appropriate. However, it seems that we're breaking the intuition of our pipeline/scoring design if we allow this kind of transformation within a |
I would say no. A |
Btw, an implementation and example of using a meta-CV splitter for resampling is at https://gist.github.com/jnothman/6bddbbcca71bdf9fd37e8495d70b42e8. Again, the main problem is its not supporting other training modification such as data reduction or insertion of label noise. |
FWIW, https://github.com/dvro/scikit-protopy/blob/master/protopy has an equivalent of "fit_[re]sample" called "reduce_data" |
One arguable use case not considered here is (parameter search in) semi-supervised learning. Assume we want to select the LDA model over a large collection of texts which maximises text classification performance with a small amount of labelled data (scikit-learn/scikit-learn#8468). In this case, the transformers need the unlabelled data at fit time, but the classifier does not, nor do they in transform/predict time. A potential, if perhaps unparadigmatic, solution may combine the following: class DiscardUnlabeled(BaseEstimator):
def fit_resample(self, X, y):
return X[y != -1], y[y != -1]
class SemiSupervisedSplit:
def __init__(self, base_cv):
self.base_cv = base_cv
def split(self, X, y):
has_label = y != -1
has_label_idx = np.flatnonzero(has_label)
no_label_idx = np.flatnonzero(~has_label)
for train_idx, test_idx in self.base_cv.split(X[has_label_idx], y[has_label_idx]):
yield np.concatenate(no_label_idx, has_label_idx[train_idx]), has_label_idx[test_idx] |
I feel more and more that the most pragmatic solution might be by now to
not introduce a new method, but instead allow X to be DataFrame that also
contains the labels. That doesn't solve everything but it makes for a much
more natural workflow for most people.
Sent from phone. Please excuse spelling and brevity.
On Feb 27, 2017 8:15 PM, "Joel Nothman" <notifications@github.com> wrote:
One arguable use case not considered here is (parameter search in)
semi-supervised learning. Assume we want to select the LDA model over a
large collection of texts which maximises text classification performance
with a small amount of labelled data. In this case, the transformers need
the unlabelled data at fit time, but the classifier does not, nor does
predict time.
A potential, if perhaps unparadigmatic, solution may combine the following:
class DiscardUnlabelled(BaseEstimator):
def fit_resample(self, X, y):
return X[y != -1], y[y != -1]
class SemiSupervisedSplit:
def __init__(self, base_cv):
self.base_cv = base_cv
def split(self, X, y):
has_label = y != -1
has_label_idx = np.flatnonzero(has_label)
no_label_idx = np.flatnonzero(~has_label)
for train_idx, test_idx in
self.base_cv.split(X[has_label_idx], y[has_label_idx]):
yield np.concatenate(no_label_idx,
has_label_idx[train_idx]), has_label_idx[test_idx]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbcFt-74PuX85iEZ0_nmEAQkhlZOdaqks5rg3UVgaJpZM4GTxW0>
.
|
I feel more and more that the most pragmatic solution might be by now to
not introduce a new method, but instead allow X to be DataFrame that also
contains the labels. That doesn't solve everything but it makes for a much
more natural workflow for most people.
I see a few limitations:
* What if X is sparse (or anything others than columns, eg free text)?
* What if y is more complex than a vector? (multi label, for instance)
* I worry a lot about data leakage in cross-validation. Scikit-learn
is quite good at preventing that (and people keep complaining that they
cannot do their weird cross-validation thing that has implicit leakage
:) ).
|
Sparse X and text can be handled by DataFrames, tough I haven't recently
checked on how good the sparse support is. Multilabel y could be handled by
a hierarchical index.
I agree that separating X and y less than our current API might have
downsides. More flexibility allows people to screw up more.
Can you give an example of the cross-validation?
The .split method had access to y so you should be able to do anything.
Sent from phone. Please excuse spelling and brevity.
…On Mar 31, 2017 10:24 AM, "Gael Varoquaux" ***@***.***> wrote:
> I feel more and more that the most pragmatic solution might be by now to
> not introduce a new method, but instead allow X to be DataFrame that also
> contains the labels. That doesn't solve everything but it makes for a
much
> more natural workflow for most people.
I see a few limitations:
* What if X is sparse (or anything others than columns, eg free text)?
* What if y is more complex than a vector? (multi label, for instance)
* I worry a lot about data leakage in cross-validation. Scikit-learn
is quite good at preventing that (and people keep complaining that they
cannot do their weird cross-validation thing that has implicit leakage
:) ).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbcFg1ljv_np9GiajRuYhBj9gHofCzBks5rrQxwgaJpZM4GTxW0>
.
|
Sparse X and text can be handled by DataFrames, tough I haven't recently checked on how good the sparse support is.
Not great, AFAIK, but I'd love to be proven wrong.
Multilabel y could be handled by a hierarchical index.
It makes things much harder and much more clumsy. Pandas is powerful, but
hard to use.
Can you give an example of the cross-validation?
Mixing X and y in a transformer.
|
Text was never a problem iirc. Sparse data requires a copy and looks a bit half-baked: http://pandas.pydata.org/pandas-docs/stable/sparse.html I don't think hierarchical indices are clumsy, though they require some understanding of pandas. |
I don't think hierarchical indices are clumsy, though they require some understanding of pandas.
My students keep making errors that are very very hard to debug (I do
too). They are handy, but lead to woodoo-style code. I don't think that
it's a good basics for an API for scikit-learn.
|
I don't see how this fixes things, for many cases. One major feature of
pipeline is that it trains the final predictor to output predictions
comparable to the evaluation ground truth. How would this be ensured if a
dataframe is used?
…On 1 Apr 2017 2:01 am, "Gael Varoquaux" ***@***.***> wrote:
> I don't think hierarchical indices are clumsy, though they require some
understanding of pandas.
My students keep making errors that are very very hard to debug (I do
too). They are handy, but lead to woodoo-style code. I don't think that
it's a good basics for an API for scikit-learn.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xSijyJmtE2rntORUtk4F8sRLTacks5rrRUMgaJpZM4GTxW0>
.
|
This is almost ready for early discussion:
Best way to look at the proposal is:
https://github.com/GaelVaroquaux/enhancement_proposals/blob/transform_y/slep001/discussion.rst