8000 add indicator features in imputer · Issue #6556 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

add indicator features in imputer #6556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Mar 16, 2016 · 55 comments
Closed

add indicator features in imputer #6556

amueller opened this issue Mar 16, 2016 · 55 comments
Labels
Moderate Anything that requires some knowledge of conventions and best practices

Comments

@amueller
Copy link
Member

The imputer should have an option to add a binary indicator feature (for each possibly missing feature) that indicate whether a feature was imputed or not (0 or 1).
In applications, missingness is often informative.

@amueller amueller added New Feature Easy Well-defined and straightforward way to resolve Need Contributor labels Mar 16, 2016
@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Mar 16, 2016 via email

@giorgiop
Copy link
Contributor

+1 . I end up implementing it by myself all the time.

@maniteja123
Copy link
Contributor

Hi, if I may, I would like to work on this. If I understand it right, only in the case of axis=0, the binary indicator of size n_features is needed to indicate whether a feature was imputed. Please correct me if something else is expected here. Thanks.

@maniteja123
Copy link
Contributor

@amueller Sorry for disturbing you, but since @rvraghav93 has also mentioned about this, it would be helpful if you can give some insight into what kind of approach is being expected here. Please do reply whenever it is possible. Thanks !

@raghavrv
Copy link
Member

If I understand it right, only in the case ofaxis=0, the binary indicator of size n_features is needed to indicate whether a feature was imputed.

No. It is applicable for both axis=0 or axis=1. The axis specification is only to derive the imputed value. (i.e to specify if we want a mean along all the feature values for that particular feature or along all the features values for that particular sample)

What we want here is to generate a simple is_imputed indicator matrix which will be of shape (n_samples x n_features). The is_imputed[i, j] will be 1 if X[i, j] was missing_values before imputation or 0 otherwise.

Whether we need this matrix to be returned or to be concatenated to the original matrix should be discussed. @amueller @GaelVaroquaux

But you can start your work assuming it has to be concatenated to the original matrix without waiting for discussions as changing it to do the other way will not require much work.

EDIT: Basically the is_imputed mask is the return value of this existing helper function.

@maniteja123
Copy link
Contributor

Thanks for the clarification ! I had commented that before I have looked more into imputation methods. I understand now that axis is to just specify whether to impute along columns(features) or rows(samples). I have one doubt though, in transform method the returned matrix will be shape (n_samples x n_features_new) but the shape of the indicator matrix will be of shape (n_samples x n_features) right ? So how does the approach of concatenation work here ? Sorry if I am misunderstood what you meant.

@raghavrv
Copy link
Member

but the shape of the indicator matrix will be of shape (n_samples x n_features) right ? So how does the approach of concatenation work here ? Sorry if I am misunderstood what you meant.

No. You understood it correctly! So now the returned X will have a shape of (n_samples x (2*n_features)). Whether or not this is acceptable needs to be discussed.

I.e the new X is np.hstack((X, is_imputed))

One more thing to discuss would be whether we would want to interleave the X and is_imputed instead of hstack-ing it.

+1 . I end up implementing it by myself all the time.

@giorgiop Could you elaborate your implementation please?

@jnothman
Copy link
Member

I think what's missing here for @maniteja123 is motivation (it was missing for me too until @rvraghav93 described the expected output more clearly): we want the estimator that is learning from the imputed data to be able to exploit the knowledge that some values come with a higher degree of error than others.

It would be nice to see this used effectively in an example. But I'm not sure what sort of data would show its utility.

Btw, is there any value in only adding indicator columns containing at least one 1, rather than redundant empty features?

@raghavrv
Copy link
Member
8000

For the example the census income data set (the original one with the missing values is a good one). You can see how people conviniently choose not to answer some questions ;)

The other one is some bankruptcy dataset which was quoted in a paper by Ding Simonoff on handling missing values in Decision Trees. I haven't found out the link for such a dataset. Their claim is that in the dataset the probability of a bank going bankrupt is higher when the bank chooses not to reveal data for certain questions (features).

For the second question. Assuming I understood correctly that your question is for columns with no missing data, Yes you are right that it won't add any value but if we choose to ignore such columns in the indicator matrix, I'm afraid it would make it difficult for the user to match the column in the indicator matrix with the corresponding column in the feature matrix...

@giorgiop
Copy link
Contributor

The motivation is the following: imputation implicitly discards the information that certain examples (rows) had missing values originally. Therefore, before imputation, we add a column for each feature that has at least one missing a value. The additional column is 0 everywhere, except that is it 1 when the corresponding value is missing.

There is no need to double the size of the input in general. Columns should only be concatenated to features that have missing values. This avoids redundancies.

@jnothman
Copy link
Member

Yes you are right that it won't add any value but if we choose to ignore such columns in the indicator matrix, I'm afraid it would make it difficult for the user to match the column in the indicator matrix with the corresponding column in the feature matrix...

I don't think imputed_features_ = [1, 2, 4, 7] is cumbersome to create, or to use! Also, get_feature_names can help.

@giorgiop
Copy link
Contributor

For the second question. Assuming I understood correctly that your question is for columns with no missing data, Yes you are right that it won't add any value but if we choose to ignore such columns in the indicator matrix, I'm afraid it would make it difficult for the user to match the column in the indicator matrix with the corresponding column in the feature matrix...

True. That's a case where features names #6425 would be very useful.

@jnothman
Copy link
Member

That's a case where features names #6425 would be very useful.

Snap!

@raghavrv
Copy link
Member

Thats an interesting issue I totally missed ;)

@giorgiop
Copy link
Contributor

Snap!

Must be a kind of Australian lunch-time deja-vu effect.

@jnothman
Copy link
Member

Must be a kind of Australian lunch-time deja-vu effect.

;)

@maniteja123
Copy link
Contributor

Thanks all for the motivation ! I don't have enough exposure to practical aspects to take an initial stance. That's why I thought it would be prudent only to continue with something after taking your advice and that's how I learn from most of the discussions here. My apologies if I am slow in my understanding.

So, the idea suggested here is to append a column to each feature which had at least one missing value which has 1 if the value was imputed. This would mean the output will be of shape (n_samples x n_features_new + n_features_with_missing). If we have theget_feature_names it would be easy to map to the original features in the transformed output.

If I understood this correctly, then what about the features that were removed after transform since it had all missing data and axis is 0 ? Please do let me if I got it right now so that I can proceed. Thanks !

@raghavrv
Copy link
Member

Joel... : I think what's missing here for @maniteja123 is motivation

By "motivation", Joel refers to the motive of adding this feature. ;)

Maniteja123 : (n_samples x n_features_new + n_features_with_missing).

Why the n_features_new? Are any of the feature columns dropped at any point?

@jnothman
Copy link
Member

By "motivation", Joel refers to the motive of adding this feature. ;)

Yes: motivation in an argumentative sense, not an emotional one!

@maniteja123
Copy link
Contributor

Thanks @rvraghav93 , I just wanted to mention that this is where I get to learn the practical implications and the efforts taken to satisfy various use cases and that is the reason I ask for your valuable input :)

And regarding the shape, the documentation here says

When axis=0, columns which only contained missing values at fit are discarded upon transform.

That does mean features will get dropped when axis=0 and there is a feature with all missing values, right ?

@raghavrv
Copy link
Member

That does mean features will get dropped when axis=0 and there is a feature with all missing values, right ?

Ah I didn't know that. Well in that case yes, you are right it will indeed be (n_samples x n_features_with_no_all_missing + n_features_with_partial_missing)! I think you can go ahead with this discussion and send a PR!

@maniteja123
Copy link
Contributor

Thanks for the clarification. Will try implementing and get back again with doubts !

And sorry Joel if I seemed emotional. I understand what you implied :).

@maniteja123
Copy link
Contributor

Just a small doubt, in the transform method at the end, the filling is done as follows

if self.axis == 0:
coordinates = np.where(mask.transpose())[::-1]
else:
coordinates = mask
X[coordinates] = values

Wouldn't something like X[mask]=values suffice here. It is not relevant to what needs to be done but just out of curiosity.

Also regarding the new shape of X, the first problem is it would break the existing user code if we change the shape of output, right ? And if it indeed is fine to be changed, then what should be the approach of adding the indicator column ?

  • Right next to the imputed column ? ( This would help in mapping to the corresponding feature )

  • stack the indicator columns after the transformed matrix ? ( This would need some additional property to know the features which had partial_nan(when axis=0) or nan(when axis=1) )

    Thanks.

@raghavrv
Copy link
Member

Wouldn't something like X[mask]=values suffice here.

Try working with this toy problem to understand why that won't work - Imputer(axis=0, missing_values=0).fit_transform([[0, 1, 1, 99], [1, 5, 0, 1], [5, 8, 8, 1], [8, 0, 99, 1]])

I would prefer stacking and adding an attribute as Joel suggested to indicate which columns had missing values originally.

@maniteja123
Copy link
Contributor

Oh thanks I understand now.. there can be rows with no missing values but that can't be the case for columns
EDIT: I didn't understand then, but got it now with toying the example. This idiomatic numpy still gets the better of me always :)

@maniteja123
Copy link
Contributor

I have created a PR with minimal changes in #6607. I am not sure it is the best approach but did spend my time to see if it works as expected. Please do have a look at your convenience and give your suggestions. Thanks.

@amueller
Copy link
Member Author

where are the bugs reported (sorry I'm kinda lost these days)? So there is new missing features in transform for which I saw the discussion in #6767. That's also the design bug, I guess, and there's a separate bug in the sparse case?

@jnothman
Copy link
Member

While I'd been trying to fix up #6782, I'm going to step back and rewrite the imputer test suite.

@jnothman
Copy link
Member

The main bug is #6767. The edge case with sparse I've just noted here. I think it mostly needs the code to be simplified. But my point is that this feature is not releasable and we have to choose if we wait for it, or if we revert it.

@amueller
Copy link
Member Author

Depends a bit on the timeline for release. Actually, I'd love your input on what should be in the release. You have a much better idea what's going on and what's critical than me at this point.

@jnothman
Copy link
Member

Actually, I'd love your input on what should be in the release. You have a much better idea what's going on and what's critical than me at this point.

I'm not sure about that. I've tried (somewhat unsuccessfully) to be particular about which threads I'm involved in. I've generally found that the algorithmic stuff progresses and gets merged without much input from me :)

This is already an enormous release. We just need to be careful to take advantage of model_selection for things like Label -> Group. We might be best checking with @raghavrv whether he thinks there were other forward-incompatible model_selection API changes worth adopting now. Residual things like finishing deprecation (#7134), better naming of new TimeSeriesCV, and either fixing or reverting Imputer.add_indicator_features are important. We should look for any other issues/PRs representing regression and broken-feature bugs and mark them as blockers. It would be nice to get max_n_classes in clustering deprecated because it's an inappropriate solution to a strange user problem, but frankly it can wait.

(The big API things we should work on for the next release are removing hard-coded exclusions from check_estimator... and multiple metrics support in grid search, which is the issue I began contributing to scikit-learn with :P)

@amueller
Copy link
Member Author

I'd love to have the multiple metrics in. Is there a current PR? @raghavrv? I want to fix as many bugs as possible, and we should get the model selection stuff sorted out to a decent degree.

Do you want to remove the hard-coded exclusions in check_estimators for 0.18 or 0.19? That means implementing estimator tags, right?
I thing we NEED to implement neg_mse (i.e. deprecate the mse scorer because it's confusing) and I think I'll work on making utils public. It would be nice to have the GMM stuff worked out because it's been broken for so long :-/

@jnothman
Copy link
Member

I know this isn't really the right place to discuss it, but I'm +1 for neg_mse, and for util clarification. +0 for GMM mostly because I'm unfamiliar. -1 for check_estimators and multiple metrics in this release. They must not be rushed. I thought of something else for 0.19: get_feature_names which requires consensus.

@jnothman jnothman added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Sep 19, 2016
@jnothman jnothman reopened this Sep 19, 2016
@raghavrv
Copy link
Member
raghavrv commented Nov 7, 2016

@maniteja123 Any updates? :)

@maniteja123
Copy link
Contributor

Hi @raghavrv , really sorry couldn't work on this. Please do let me know if it is urgent. Then it would be better if someone else is interested to take up. I would definitely like to get it right this time but at the same time can't commit to do it in the next ten days. But I just wanted to know if option 3 is the preferred ones among these or option 1 so that I can keep working on it whenever there is some time. Thanks !

@jnothman
Copy link
Member

Proposed design:

We should create a separate Transformer, perhaps MissingMarker or MissingIndicator which just returns the missingness mask. At a minimum, it should work in the axis=0 case, and mask only features where missing values are seen in training. It would have a sparse={True,False,'auto' (default)} parameter where auto copies the input type. (This means that FeatureUnions with the MissingIndicator won't make everything sparse unless the original data is.)

This MissingIndicator can be unioned with any imputer, but also avoids adding much complexity to the existing Imputer implementation. We can also, perhaps, add an add_indicators option to Imputer just as a helper (and because @amueller thinks this should be on by default), which uses the MissingIndicator.

The downside is that the duplication of work: the mask is identified in Imputer and MissingIndicator. I don't think that's a big deal.

Thoughts? Contributors?

@amueller
Copy link
Member Author

@jnothman I'm not entirely sure I understand the motivation for the design. You want to keep the complexity in the Imputer small?

I don't know about all the issues the previous implementation had.
As long as adding the features is easy to do and doesn't require a FeatureUnion for the inexperienced user, I'm ok with adding a class.
It seems like a simple thing but I guess it is not?

Have we made a decision on the semantics? We add indicators for features that were missing in training and optionally either error or ignore if other features are missing?
I think the natural thing would be to either have features for the ones missing in train, for all features, or for a specified subset of features. ("missing_indicators in ["none", "all", "train", indices/mask]).
We were also discussing the option of having a global "anything missing" feature.

I'm not sure what axis=1 should mean.

@jnothman
Copy link
Member

I think part of the issue in the previous design was trying to handle all the cases in which Imputer worked, and ideally to do so without duplicating the get_mask operation. Given the surprising complexity of the code, I think modularity will make the code much easier to implement, test and maintain, for a O(mn) cost.

Otherwise, the advantage to a separate transformer is mostly that users of external imputers can use it too.

I don't mind any of those bonus features. But I can more easily conceive of their design and documentation in a separate MissingIndicator than integrated into Imputer.

And axis=1 currently means that the missing values are set to a statistic derived from non-missing values in their row. I suspect that the MissingIndicator should work the same independent of axis.

@amueller
Copy link
Member Author

Ok, I'm fine with an additional class.

@maniteja123
Copy link
Contributor

Hi, it's been long time. I was thinking of giving this a try. AFAIU, the transformer now is just the indicator mask for the missing values and can be used with any imputer. I was hoping to get clarity on the semantics of the new class.

  • meaning of None for ``missing_indicators"
  • what does this option mean for axis =1
  • should there be features indicating features with all missing values.

Please let me know if I misunderstood anything. Thanks !

@maniteja123
Copy link
Contributor

I have attempted an initial implementation here. Please do have a look at your convenience and give feedback. Thanks !

@raghavrv
Copy link
Member

Could you raise it as a PR... It will be easier to review that way... Thanks for working again on this...

@maniteja123
Copy link
Contributor

Yeah sure, but it was a very basic implementation and I wasn't sure if I could complete it. So I thought of not raising a PR in case someone wanted to take it up. Sorry for the delay though..

@jnothman
Copy link
Member

Unless I've forgotten something, I think the MissingIndicator's operation should be independent of axis. I also don't see the need for a copy parameter.

We'll probably integrate this into Imputer via an option, but initially, yes, let's build it as a separate class. Please add tests, and submit a PR.

@jnothman
Copy link
Member

Rather, please submit a PR and add tests.

@pshell1
Copy link
pshell1 commented May 11, 2017

There's an easy way to know which features were included, though it's slightly different from the design you were discussing. In fact the imputer used to do this (the version that came with Anaconda 2.5). It returned a second value containing a vector of the indices of the columns that remained. The transform method still has this info; if you just change the return statement to:
return X, valid_statistics_indexes

that will do the trick. My code which uses the imputer uses that second value. I'm not sure why it was removed.

@jnothman
Copy link
Member
jnothman commented May 11, 2017 via email

amueller pushed a commit that referenced this issue Jul 16, 2018
MissingIndicator transformer for the missing values indicator mask.
see #6556 

#### What does this implement/fix? Explain your changes.
The current implementation returns a indicator mask for the missing values.

#### Any other comments?
It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks !

- [X] Implementation
- [x] Documentation
- [x] Tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests

8 participants
0