-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
add indicator features in imputer #6556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What kind of API do you have in mind?
|
+1 . I end up implementing it by myself all the time. |
Hi, if I may, I would like to work on this. If I understand it right, only in the case of |
@amueller Sorry for disturbing you, but since @rvraghav93 has also mentioned about this, it would be helpful if you can give some insight into what kind of approach is being expected here. Please do reply whenever it is possible. Thanks ! |
No. It is applicable for both What we want here is to generate a simple Whether we need this matrix to be returned or to be concatenated to the original matrix should be discussed. @amueller @GaelVaroquaux But you can start your work assuming it has to be concatenated to the original matrix without waiting for discussions as changing it to do the other way will not require much work. EDIT: Basically the |
Thanks for the clarification ! I had commented that before I have looked more into imputation methods. I understand now that axis is to just specify whether to impute along columns(features) or rows(samples). I have one doubt though, in |
No. You understood it correctly! So now the returned I.e the new X is One more thing to discuss would be whether we would want to interleave the
@giorgiop Could you elaborate your implementation please? |
I think what's missing here for @maniteja123 is motivation (it was missing for me too until @rvraghav93 described the expected output more clearly): we want the estimator that is learning from the imputed data to be able to exploit the knowledge that some values come with a higher degree of error than others. It would be nice to see this used effectively in an example. But I'm not sure what sort of data would show its utility. Btw, is there any value in only adding indicator columns containing at least one 1, rather than redundant empty features? |
For the example the census income data set (the original one with the missing values is a good one). You can see how people conviniently choose not to answer some questions ;) The other one is some bankruptcy dataset which was quoted in a paper by Ding Simonoff on handling missing values in Decision Trees. I haven't found out the link for such a dataset. Their claim is that in the dataset the probability of a bank going bankrupt is higher when the bank chooses not to reveal data for certain questions (features). For the second question. Assuming I understood correctly that your question is for columns with no missing data, Yes you are right that it won't add any value but if we choose to ignore such columns in the indicator matrix, I'm afraid it would make it difficult for the user to match the column in the indicator matrix with the corresponding column in the feature matrix... |
The motivation is the following: imputation implicitly discards the information that certain examples (rows) had missing values originally. Therefore, before imputation, we add a column for each feature that has at least one missing a value. The additional column is 0 everywhere, except that is it 1 when the corresponding value is missing. There is no need to double the size of the input in general. Columns should only be concatenated to features that have missing values. This avoids redundancies. |
I don't think |
True. That's a case where features names #6425 would be very useful. |
Snap! |
Thats an interesting issue I totally missed ;) |
Must be a kind of Australian lunch-time deja-vu effect. |
;) |
Thanks all for the motivation ! I don't have enough exposure to practical aspects to take an initial stance. That's why I thought it would be prudent only to continue with something after taking your advice and that's how I learn from most of the discussions here. My apologies if I am slow in my understanding. So, the idea suggested here is to append a column to each feature which had at least one missing value which has 1 if the value was imputed. This would mean the output will be of shape If I understood this correctly, then what about the features that were removed after |
By "motivation", Joel refers to the motive of adding this feature. ;)
Why the |
Yes: motivation in an argumentative sense, not an emotional one! |
Thanks @rvraghav93 , I just wanted to mention that this is where I get to learn the practical implications and the efforts taken to satisfy various use cases and that is the reason I ask for your valuable input :) And regarding the shape, the documentation here says
That does mean features will get dropped when |
Ah I didn't know that. Well in that case yes, you are right it will indeed be |
Thanks for the clarification. Will try implementing and get back again with doubts ! And sorry Joel if I seemed emotional. I understand what you implied :). |
Just a small doubt, in the transform method at the end, the filling is done as follows
Wouldn't something like Also regarding the new shape of X, the first problem is it would break the existing user code if we change the shape of output, right ? And if it indeed is fine to be changed, then what should be the approach of adding the indicator column ?
|
Try working with this toy problem to understand why that won't work - I would prefer stacking and adding an attribute as Joel suggested to indicate which columns had missing values originally. |
Oh thanks I understand now.. |
I have created a PR with minimal changes in #6607. I am not sure it is the best approach but did spend my time to see if it works as expected. Please do have a look at your convenience and give your suggestions. Thanks. |
where are the bugs reported (sorry I'm kinda lost these days)? So there is new missing features in |
While I'd been trying to fix up #6782, I'm going to step back and rewrite the imputer test suite. |
The main bug is #6767. The edge case with sparse I've just noted here. I think it mostly needs the code to be simplified. But my point is that this feature is not releasable and we have to choose if we wait for it, or if we revert it. |
Depends a bit on the timeline for release. Actually, I'd love your input on what should be in the release. You have a much better idea what's going on and what's critical than me at this point. |
I'm not sure about that. I've tried (somewhat unsuccessfully) to be particular about which threads I'm involved in. I've generally found that the algorithmic stuff progresses and gets merged without much input from me :) This is already an enormous release. We just need to be careful to take advantage of (The big API things we should work on for the next release are removing hard-coded exclusions from |
I'd love to have the multiple metrics in. Is there a current PR? @raghavrv? I want to fix as many bugs as possible, and we should get the model selection stuff sorted out to a decent degree. Do you want to remove the hard-coded exclusions in |
I know this isn't really the right place to discuss it, but I'm +1 for neg_mse, and for util clarification. +0 for GMM mostly because I'm unfamiliar. -1 for check_estimators and multiple metrics in this release. They must not be rushed. I thought of something else for 0.19: |
@maniteja123 Any updates? :) |
Hi @raghavrv , really sorry couldn't work on this. Please do let me know if it is urgent. Then it would be better if someone else is interested to take up. I would definitely like to get it right this time but at the same time can't commit to do it in the next ten days. But I just wanted to know if option 3 is the preferred ones among these or option 1 so that I can keep working on it whenever there is some time. Thanks ! |
Proposed design: We should create a separate This The downside is that the duplication of work: the mask is identified in Thoughts? Contributors? |
@jnothman I'm not entirely sure I understand the motivation for the design. You want to keep the complexity in the I don't know about all the issues the previous implementation had. Have we made a decision on the semantics? We add indicators for features that were missing in training and optionally either error or ignore if other features are missing? I'm not sure what |
I think part of the issue in the previous design was trying to handle all the cases in which Otherwise, the advantage to a separate transformer is mostly that users of external imputers can use it too. I don't mind any of those bonus features. But I can more easily conceive of their design and documentation in a separate MissingIndicator than integrated into Imputer. And |
Ok, I'm fine with an additional class. |
Hi, it's been long time. I was thinking of giving this a try. AFAIU, the transformer now is just the indicator mask for the missing values and can be used with any imputer. I was hoping to get clarity on the semantics of the new class.
Please let me know if I misunderstood anything. Thanks ! |
I have attempted an initial implementation here. Please do have a look at your convenience and give feedback. Thanks ! |
Could you raise it as a PR... It will be easier to review that way... Thanks for working again on this... |
Yeah sure, but it was a very basic implementation and I wasn't sure if I could complete it. So I thought of not raising a PR in case someone wanted to take it up. Sorry for the delay though.. |
Unless I've forgotten something, I think the We'll probably integrate this into |
Rather, please submit a PR and add tests. |
There's an easy way to know which features were included, though it's slightly different from the design you were discussing. In fact the imputer used to do this (the version that came with Anaconda 2.5). It returned a second value containing a vector of the indices of the columns that remained. The transform method still has this info; if you just change the return statement to: that will do the trick. My code which uses the imputer uses that second value. I'm not sure why it was removed. |
which features is not the concern. but I'm quite surprised about that
statement regarding the Anaconda release
…On 12 May 2017 5:44 am, "pshell1" ***@***.***> wrote:
There's an easy way to know which features were included, though it's
slightly different from the design you were discussing. In fact the imputer
used to do this (the version that came with Anaconda 2.5). It returned a
second value containing a vector of the indices of the columns that
remained. The transform method still has this info; if you just change the
return statement to:
return X, valid_statistics_indexes
that will do the trick. My code which uses the imputer uses that second
value. I'm not sure why it was removed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6556 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63QooC-D_NReCnygfM-U1xDmG5JBks5r42UbgaJpZM4Hya_O>
.
|
MissingIndicator transformer for the missing values indicator mask. see #6556 #### What does this implement/fix? Explain your changes. The current implementation returns a indicator mask for the missing values. #### Any other comments? It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks ! - [X] Implementation - [x] Documentation - [x] Tests
The imputer should have an option to add a binary indicator feature (for each possibly missing feature) that indicate whether a feature was imputed or not (0 or 1).
In applications, missingness is often informative.
The text was updated successfully, but these errors were encountered: