add indicator features in imputer #6556

amueller · 2016-03-16T21:24:18Z

The imputer should have an option to add a binary indicator feature (for each possibly missing feature) that indicate whether a feature was imputed or not (0 or 1).
In applications, missingness is often informative.

GaelVaroquaux · 2016-03-16T21:25:24Z

What kind of API do you have in mind?

giorgiop · 2016-03-16T21:47:03Z

+1 . I end up implementing it by myself all the time.

maniteja123 · 2016-03-17T03:30:33Z

Hi, if I may, I would like to work on this. If I understand it right, only in the case of axis=0, the binary indicator of size n_features is needed to indicate whether a feature was imputed. Please correct me if something else is expected here. Thanks.

maniteja123 · 2016-03-29T19:14:11Z

@amueller Sorry for disturbing you, but since @rvraghav93 has also mentioned about this, it would be helpful if you can give some insight into what kind of approach is being expected here. Please do reply whenever it is possible. Thanks !

raghavrv · 2016-03-29T19:36:29Z

If I understand it right, only in the case ofaxis=0, the binary indicator of size n_features is needed to indicate whether a feature was imputed.

No. It is applicable for both axis=0 or axis=1. The axis specification is only to derive the imputed value. (i.e to specify if we want a mean along all the feature values for that particular feature or along all the features values for that particular sample)

What we want here is to generate a simple is_imputed indicator matrix which will be of shape (n_samples x n_features). The is_imputed[i, j] will be 1 if X[i, j] was missing_values before imputation or 0 otherwise.

Whether we need this matrix to be returned or to be concatenated to the original matrix should be discussed. @amueller @GaelVaroquaux

But you can start your work assuming it has to be concatenated to the original matrix without waiting for discussions as changing it to do the other way will not require much work.

EDIT: Basically the is_imputed mask is the return value of this existing helper function.

maniteja123 · 2016-03-29T19:48:34Z

Thanks for the clarification ! I had commented that before I have looked more into imputation methods. I understand now that axis is to just specify whether to impute along columns(features) or rows(samples). I have one doubt though, in transform method the returned matrix will be shape (n_samples x n_features_new) but the shape of the indicator matrix will be of shape (n_samples x n_features) right ? So how does the approach of concatenation work here ? Sorry if I am misunderstood what you meant.

raghavrv · 2016-03-29T19:52:29Z

but the shape of the indicator matrix will be of shape (n_samples x n_features) right ? So how does the approach of concatenation work here ? Sorry if I am misunderstood what you meant.

No. You understood it correctly! So now the returned X will have a shape of (n_samples x (2*n_features)). Whether or not this is acceptable needs to be discussed.

I.e the new X is np.hstack((X, is_imputed))

One more thing to discuss would be whether we would want to interleave the X and is_imputed instead of hstack-ing it.

+1 . I end up implementing it by myself all the time.

@giorgiop Could you elaborate your implementation please?

jnothman · 2016-03-30T01:01:42Z

I think what's missing here for @maniteja123 is motivation (it was missing for me too until @rvraghav93 described the expected output more clearly): we want the estimator that is learning from the imputed data to be able to exploit the knowledge that some values come with a higher degree of error than others.

It would be nice to see this used effectively in an example. But I'm not sure what sort of data would show its utility.

Btw, is there any value in only adding indicator columns containing at least one 1, rather than redundant empty features?

raghavrv · 2016-03-30T01:20:10Z

For the example the census income data set (the original one with the missing values is a good one). You can see how people conviniently choose not to answer some questions ;)

The other one is some bankruptcy dataset which was quoted in a paper by Ding Simonoff on handling missing values in Decision Trees. I haven't found out the link for such a dataset. Their claim is that in the dataset the probability of a bank going bankrupt is higher when the bank chooses not to reveal data for certain questions (features).

For the second question. Assuming I understood correctly that your question is for columns with no missing data, Yes you are right that it won't add any value but if we choose to ignore such columns in the indicator matrix, I'm afraid it would make it difficult for the user to match the column in the indicator matrix with the corresponding column in the feature matrix...

giorgiop · 2016-03-30T01:22:04Z

The motivation is the following: imputation implicitly discards the information that certain examples (rows) had missing values originally. Therefore, before imputation, we add a column for each feature that has at least one missing a value. The additional column is 0 everywhere, except that is it 1 when the corresponding value is missing.

There is no need to double the size of the input in general. Columns should only be concatenated to features that have missing values. This avoids redundancies.

jnothman · 2016-03-30T01:25:04Z

Yes you are right that it won't add any value but if we choose to ignore such columns in the indicator matrix, I'm afraid it would make it difficult for the user to match the column in the indicator matrix with the corresponding column in the feature matrix...

I don't think imputed_features_ = [1, 2, 4, 7] is cumbersome to create, or to use! Also, get_feature_names can help.

giorgiop · 2016-03-30T01:25:10Z

For the second question. Assuming I understood correctly that your question is for columns with no missing data, Yes you are right that it won't add any value but if we choose to ignore such columns in the indicator matrix, I'm afraid it would make it difficult for the user to match the column in the indicator matrix with the corresponding column in the feature matrix...

True. That's a case where features names #6425 would be very useful.

jnothman · 2016-03-30T01:25:40Z

That's a case where features names #6425 would be very useful.

Snap!

raghavrv · 2016-03-30T01:31:02Z

Thats an interesting issue I totally missed ;)

giorgiop · 2016-03-30T01:33:35Z

Snap!

Must be a kind of Australian lunch-time deja-vu effect.

jnothman · 2016-03-30T01:37:41Z

Must be a kind of Australian lunch-time deja-vu effect.

;)

maniteja123 · 2016-03-30T12:41:56Z

Thanks all for the motivation ! I don't have enough exposure to practical aspects to take an initial stance. That's why I thought it would be prudent only to continue with something after taking your advice and that's how I learn from most of the discussions here. My apologies if I am slow in my understanding.

So, the idea suggested here is to append a column to each feature which had at least one missing value which has 1 if the value was imputed. This would mean the output will be of shape (n_samples x n_features_new + n_features_with_missing). If we have theget_feature_names it would be easy to map to the original features in the transformed output.

If I understood this correctly, then what about the features that were removed after transform since it had all missing data and axis is 0 ? Please do let me if I got it right now so that I can proceed. Thanks !

raghavrv · 2016-03-30T12:47:51Z

Joel... : I think what's missing here for @maniteja123 is motivation

By "motivation", Joel refers to the motive of adding this feature. ;)

Maniteja123 : (n_samples x n_features_new + n_features_with_missing).

Why the n_features_new? Are any of the feature columns dropped at any point?

jnothman · 2016-03-30T12:51:54Z

By "motivation", Joel refers to the motive of adding this feature. ;)

Yes: motivation in an argumentative sense, not an emotional one!

maniteja123 · 2016-03-30T12:56:31Z

Thanks @rvraghav93 , I just wanted to mention that this is where I get to learn the practical implications and the efforts taken to satisfy various use cases and that is the reason I ask for your valuable input :)

And regarding the shape, the documentation here says

When axis=0, columns which only contained missing values at fit are discarded upon transform.

That does mean features will get dropped when axis=0 and there is a feature with all missing values, right ?

raghavrv · 2016-03-30T12:59:47Z

That does mean features will get dropped when axis=0 and there is a feature with all missing values, right ?

Ah I didn't know that. Well in that case yes, you are right it will indeed be (n_samples x n_features_with_no_all_missing + n_features_with_partial_missing)! I think you can go ahead with this discussion and send a PR!

maniteja123 · 2016-03-30T13:10:04Z

Thanks for the clarification. Will try implementing and get back again with doubts !

And sorry Joel if I seemed emotional. I understand what you implied :).

maniteja123 · 2016-03-30T14:19:42Z

Just a small doubt, in the transform method at the end, the filling is done as follows

if self.axis == 0:
coordinates = np.where(mask.transpose())[::-1]
else:
coordinates = mask
X[coordinates] = values

Wouldn't something like X[mask]=values suffice here. It is not relevant to what needs to be done but just out of curiosity.

Also regarding the new shape of X, the first problem is it would break the existing user code if we change the shape of output, right ? And if it indeed is fine to be changed, then what should be the approach of adding the indicator column ?

Right next to the imputed column ? ( This would help in mapping to the corresponding feature )
stack the indicator columns after the transformed matrix ? ( This would need some additional property to know the features which had partial_nan(when axis=0) or nan(when axis=1) )

Thanks.

raghavrv · 2016-03-30T15:18:42Z

maniteja123 · 2016-03-30T15:53:35Z

Oh thanks I understand now.. ~~there can be rows with no missing values but that can't be the case for columns~~
EDIT: I didn't understand then, but got it now with toying the example. This idiomatic numpy still gets the better of me always :)

maniteja123 · 2016-03-30T18:00:24Z

I have created a PR with minimal changes in #6607. I am not sure it is the best approach but did spend my time to see if it works as expected. Please do have a look at your convenience and give your suggestions. Thanks.

amueller · 2016-08-23T21:12:34Z

where are the bugs reported (sorry I'm kinda lost these days)? So there is new missing features in transform for which I saw the discussion in #6767. That's also the design bug, I guess, and there's a separate bug in the sparse case?

jnothman · 2016-08-23T21:58:56Z

While I'd been trying to fix up #6782, I'm going to step back and rewrite the imputer test suite.

jnothman · 2016-08-24T01:29:31Z

The main bug is #6767. The edge case with sparse I've just noted here. I think it mostly needs the code to be simplified. But my point is that this feature is not releasable and we have to choose if we wait for it, or if we revert it.

amueller · 2016-08-24T14:33:27Z

Depends a bit on the timeline for release. Actually, I'd love your input on what should be in the release. You have a much better idea what's going on and what's critical than me at this point.

jnothman · 2016-08-25T01:46:01Z

Actually, I'd love your input on what should be in the release. You have a much better idea what's going on and what's critical than me at this point.

I'm not sure about that. I've tried (somewhat unsuccessfully) to be particular about which threads I'm involved in. I've generally found that the algorithmic stuff progresses and gets merged without much input from me :)

This is already an enormous release. We just need to be careful to take advantage of model_selection for things like Label -> Group. We might be best checking with @raghavrv whether he thinks there were other forward-incompatible model_selection API changes worth adopting now. Residual things like finishing deprecation (#7134), better naming of new TimeSeriesCV, and either fixing or reverting Imputer.add_indicator_features are important. We should look for any other issues/PRs representing regression and broken-feature bugs and mark them as blockers. It would be nice to get max_n_classes in clustering deprecated because it's an inappropriate solution to a strange user problem, but frankly it can wait.

(The big API things we should work on for the next release are removing hard-coded exclusions from check_estimator... and multiple metrics support in grid search, which is the issue I began contributing to scikit-learn with :P)

amueller · 2016-08-26T16:47:40Z

I'd love to have the multiple metrics in. Is there a current PR? @raghavrv? I want to fix as many bugs as possible, and we should get the model selection stuff sorted out to a decent degree.

Do you want to remove the hard-coded exclusions in check_estimators for 0.18 or 0.19? That means implementing estimator tags, right?
I thing we NEED to implement neg_mse (i.e. deprecate the mse scorer because it's confusing) and I think I'll work on making utils public. It would be nice to have the GMM stuff worked out because it's been broken for so long :-/

jnothman · 2016-08-27T10:39:47Z

I know this isn't really the right place to discuss it, but I'm +1 for neg_mse, and for util clarification. +0 for GMM mostly because I'm unfamiliar. -1 for check_estimators and multiple metrics in this release. They must not be rushed. I thought of something else for 0.19: get_feature_names which requires consensus.

raghavrv · 2016-11-07T14:56:35Z

@maniteja123 Any updates? :)

maniteja123 · 2016-11-08T13:38:33Z

Hi @raghavrv , really sorry couldn't work on this. Please do let me know if it is urgent. Then it would be better if someone else is interested to take up. I would definitely like to get it right this time but at the same time can't commit to do it in the next ten days. But I just wanted to know if option 3 is the preferred ones among these or option 1 so that I can keep working on it whenever there is some time. Thanks !

jnothman · 2016-12-14T20:49:11Z

Proposed design:

We should create a separate Transformer, perhaps MissingMarker or MissingIndicator which just returns the missingness mask. At a minimum, it should work in the axis=0 case, and mask only features where missing values are seen in training. It would have a sparse={True,False,'auto' (default)} parameter where auto copies the input type. (This means that FeatureUnions with the MissingIndicator won't make everything sparse unless the original data is.)

This MissingIndicator can be unioned with any imputer, but also avoids adding much complexity to the existing Imputer implementation. We can also, perhaps, add an add_indicators option to Imputer just as a helper (and because @amueller thinks this should be on by default), which uses the MissingIndicator.

The downside is that the duplication of work: the mask is identified in Imputer and MissingIndicator. I don't think that's a big deal.

Thoughts? Contributors?

amueller · 2016-12-14T21:11:41Z

@jnothman I'm not entirely sure I understand the motivation for the design. You want to keep the complexity in the Imputer small?

I don't know about all the issues the previous implementation had.
As long as adding the features is easy to do and doesn't require a FeatureUnion for the inexperienced user, I'm ok with adding a class.
It seems like a simple thing but I guess it is not?

Have we made a decision on the semantics? We add indicators for features that were missing in training and optionally either error or ignore if other features are missing?
I think the natural thing would be to either have features for the ones missing in train, for all features, or for a specified subset of features. ("missing_indicators in ["none", "all", "train", indices/mask]).
We were also discussing the option of having a global "anything missing" feature.

I'm not sure what axis=1 should mean.

jnothman · 2016-12-14T23:03:13Z

I think part of the issue in the previous design was trying to handle all the cases in which Imputer worked, and ideally to do so without duplicating the get_mask operation. Given the surprising complexity of the code, I think modularity will make the code much easier to implement, test and maintain, for a O(mn) cost.

Otherwise, the advantage to a separate transformer is mostly that users of external imputers can use it too.

I don't mind any of those bonus features. But I can more easily conceive of their design and documentation in a separate MissingIndicator than integrated into Imputer.

And axis=1 currently means that the missing values are set to a statistic derived from non-missing values in their row. I suspect that the MissingIndicator should work the same independent of axis.

amueller · 2016-12-16T21:13:54Z

Ok, I'm fine with an additional class.

maniteja123 · 2016-12-17T20:05:34Z

Hi, it's been long time. I was thinking of giving this a try. AFAIU, the transformer now is just the indicator mask for the missing values and can be used with any imputer. I was hoping to get clarity on the semantics of the new class.

meaning of None for ``missing_indicators"
what does this option mean for axis =1
should there be features indicating features with all missing values.

Please let me know if I misunderstood anything. Thanks !

maniteja123 · 2016-12-17T20:16:21Z

I have attempted an initial implementation here. Please do have a look at your convenience and give feedback. Thanks !

raghavrv · 2016-12-18T09:57:00Z

Could you raise it as a PR... It will be easier to review that way... Thanks for working again on this...

maniteja123 · 2016-12-18T10:11:07Z

Yeah sure, but it was a very basic implementation and I wasn't sure if I could complete it. So I thought of not raising a PR in case someone wanted to take it up. Sorry for the delay though..

jnothman · 2016-12-20T02:43:25Z

Unless I've forgotten something, I think the MissingIndicator's operation should be independent of axis. I also don't see the need for a copy parameter.

We'll probably integrate this into Imputer via an option, but initially, yes, let's build it as a separate class. Please add tests, and submit a PR.

jnothman · 2016-12-20T02:50:54Z

Rather, please submit a PR and add tests.

pshell1 · 2017-05-11T19:44:26Z

There's an easy way to know which features were included, though it's slightly different from the design you were discussing. In fact the imputer used to do this (the version that came with Anaconda 2.5). It returned a second value containing a vector of the indices of the columns that remained. The transform method still has this info; if you just change the return statement to:
return X, valid_statistics_indexes

that will do the trick. My code which uses the imputer uses that second value. I'm not sure why it was removed.

jnothman · 2017-05-11T22:11:35Z

which features is not the concern. but I'm quite surprised about that statement regarding the Anaconda release

…

On 12 May 2017 5:44 am, "pshell1" ***@***.***> wrote: There's an easy way to know which features were included, though it's slightly different from the design you were discussing. In fact the imputer used to do this (the version that came with Anaconda 2.5). It returned a second value containing a vector of the indices of the columns that remained. The transform method still has this info; if you just change the return statement to: return X, valid_statistics_indexes that will do the trick. My code which uses the imputer uses that second value. I'm not sure why it was removed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6556 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63QooC-D_NReCnygfM-U1xDmG5JBks5r42UbgaJpZM4Hya_O> .

MissingIndicator transformer for the missing values indicator mask. see #6556 #### What does this implement/fix? Explain your changes. The current implementation returns a indicator mask for the missing values. #### Any other comments? It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks ! - [X] Implementation - [x] Documentation - [x] Tests

amueller added New Feature Easy Well-defined and straightforward way to resolve Need Contributor labels Mar 16, 2016

maniteja123 mentioned this issue Mar 30, 2016

[MRG+1] Add add_indicator_features option to show missing values in the output #6607

Merged

jnothman added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Sep 19, 2016

jnothman reopened this Sep 19, 2016

maniteja123 mentioned this issue Dec 18, 2016

[MRG+1] MissingIndicator transformer #8075

Merged

3 tasks

amueller removed the New Feature label Mar 3, 2017

raghavrv removed the Need Contributor label Mar 4, 2017

amueller mentioned this issue Jun 28, 2018

Support standard data science use-case #10603

Open

jnothman closed this as completed Aug 21, 2018

Uh oh!

add indicator features in imputer #6556

add indicator features in imputer #6556

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!