-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -474,11 +474,11 @@ then the shape of the transformed input is | |||
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values='NaN', | |||
strategy='mean', verbose=0) | |||
>>> print(imp_with_in.transform(X)) # doctest: +ELLIPSIS | |||
[[ 4. 2. 1. 0. ] | |||
[ 6. 3.66666667 0. 1. ] | |||
[[ 4. 2. 0. 0. ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What??? No, the indicator features need to correspond to the transform
input. The X.shape[0]
does not need to be the same for fit
and transform
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly, you were right to ask questions because you've still missed the point. Have you used scikit-learn's transformers? Do you understand their API design? Do you understand what this add_indicator_features
is for?
Yes, we made an error in reviewing the first PR, but not one as glaring as to get this wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies!! This is really wrong. I have accidentally pushed the wrong commit from my other machine. I totally understand the API design and that the the fix is just about the shape of transformed output being consistent for different inputs. I will revert everything correctly.
But one final question please - now indicator features will have a column for each feature regardless of that feature having missing values in the current transform input. But the imputed_features_ will have only features with at least one missing value. The reason is I can't do X[:, expected_imputer_features] because imputed_features won't have that feature but the indicator features will still have it. This is the only thing I suppose I didn't ask about.
Sorry that I actually don't know what exactly how this indicator features help but I understand that they provide information about the missing data and thus can help improve the accuracy. And in case if you feel I am still not getting it right, please do let me know if it will be better for someone else to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like you to have a go, but showing what you've got early - and writing tests to show your understanding - helps make explicit checkpoints.
I was thinking a couple of hours ago about the case where a feature has missing values at transform
time but not at fit
. I do think this is a possible case, but I also think it's silly to add an indicator column for each of thousands of features not requiring imputation. We could actually allow the user to specify which features may require imputation (and hence both statistics and indicator matrix), with a default to infer them from the training data. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, will definitely do that.
The idea of user specifying seems good but would it always be the case that the user is aware of the features with missing data in the whole training data ?. And even if they do, will it be consistent for all the samples ?
Also if we proceed to add indicator features for every feature irrespective of imputation, something like variance threshold can remove such indicators features that correspond to features with no missing values right ? Or maybe give an option specifically for the indicator features to have all features or just imputed features ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under shuffled cross-validation, you can't always be sure what's in your training and what's in your test. That's why allowing them to specify to Imputer may be important. However, to take back what I said, the indicator features won't be any use if they're not seen in training, so it's probably valid not to add a parameter to specify which features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I get the point of not having it since it should be the case where you know what features are missing in training and testing data. So it is better to proceed with adding indicator features for all the features?
Sorry the comments went to the outdated diff. I hope to have at least done it somewhat in the correct direction. The tests will fail right now because Also there is a possibility of your case when input to |
No, I think it's better to proceed as we were: indicator features only for On 15 May 2016 at 19:27, Maniteja Nandana notifications@github.com wrote:
|
If there is no missing value at fit time, but there is at transform, do we On 15 May 2016 at 21:12, Joel Nothman joel.nothman@gmail.com wrote:
|
*to learn its meaning from On 15 May 2016 at 21:13, Joel Nothman joel.nothman@gmail.com wrote:
|
The statistics are calculated at fit time except for features with all missing values for axis=0. So if feature has missing value in transform it does get imputed. But if I am not mistaken now depending on input to transform, the shape of indicator features matrix will again vary if we go with something but not everything. |
Then I'm sure you're mistaken. Give me an example; a test case? |
I'm not sure what you mean by your "except for", either. |
Sorry, I meant it will be |
The golden rule of every call to transform must output the same number of features says you must be wrong. No, we add indicator features according to missing values at |
Sorry won't bother you more. I will try to show an example of what I understand here.
Is this the expected imputation since it depends on the missing values at fit time ? |
Yes, that seems to be right. On 15 May 2016 at 23:52, Maniteja Nandana notifications@github.com wrote:
|
But if you look at transformed output of X2 then the feature with no missing values(feature 3) also has a indicator feature since it has missing value in X1, the input for And similarly for second case, the transformed output for X1 has no indicator feature for feature 3 though it has missing value since X2(input for Is that okay since we wanted only features with some missing values (but not everything)? |
Yes, I think that's alright. |
Thanks for clarifying. Will do the changes and push them in a short while once I get back home. |
Hi sorry for pinging but I have done the changes to the best of my understanding. Please have a look if it is the expected behavior. Thanks. |
@@ -114,7 +114,8 @@ class Imputer(BaseEstimator, TransformerMixin): | |||
The imputation fill value for each feature if axis == 0. | |||
|
|||
imputed_features_ : array of shape (n_features_with_missing, ) | |||
The input features which have been imputed during transform. | |||
This attribute is available only when ``axis=0``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this after the following sentence, please?
Sorry the pep8 was old in my system which I failed to observe. I have updated it and fixed almost all except some where there are more than 79 characters. Right now , the shape of output should be independent of input shape of transform and dependent on fit only. Please have a look at your convenience and let me know if there is anything else to done. Thanks. |
we do generally adhere to that requirement |
dtype=X.dtype) | ||
inv_mask_matrix.eliminate_zeros() # removes explicit False entries | ||
all_missing_ = inv_mask_matrix.sum(axis=0).A.nonzero()[1] | ||
self.mask_matrix_ = self.mask_matrix_[:, self.imputed_features_] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not be setting any attributes at transform time
@@ -355,6 +355,8 @@ def _sparse_transform(self, X, statistics): | |||
|
|||
if self.axis == 0: | |||
all_missing = np.where(mask_matrix.sum(axis=0) == X.shape[0])[0] | |||
if isinstance(all_missing, np.matrix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment that this is for old scipy
Thanks @maniteja123. Apart from not being sure that it's sufficiently documented, this is looking pretty good to what we specified. Even so (sorry @mfeurer), I continue to be undecided on whether it's the correct approach. I'm now wondering, for instance: why, if we're only adding indicators for features with training-time missing values when axis=0, are we not doing the same for Alternatively, we could just default to including all features in the indicator matrix, unless specified by providing a list of feature indices like |
@maniteja123, I'd like to make some changes to this myself, in part because I keep changing the spec, in part because I'm a little anxious to make sure this makes it into 0.18 (or else IMO we'll need to revert |
Sure Joel, please do take over the work in case you wish to. I understand that this needs to come to a conclusive state before the release. Sorry for being unable to complete it in time. Do let me know if I can do anything to assist. |
Any updates on this? I need this for my benchmarks... |
features = np.setdiff1d(self.all_missing_, all_missing) | ||
warnings.warn("The features %s have all missing " | ||
"values in fit but have non-missing values" | ||
"in transform " % features, RuntimeWarning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before in
features = np.setdiff1d(self.all_missing_, all_missing) | ||
warnings.warn("The features %s have all missing" | ||
" values in fit but have non-missing values in" | ||
"transform" % features, RuntimeWarning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before transform
This wasn't the rewrite? |
My bad this simply tries to address the issue raised by @mfeurer... Should we close this then? |
If this is a patch against the old implementation, which I think it is, then yes. @maniteja123 can you confirm? |
Indeed it is. |
Hi yeah it is based on old implementation. I suppose it can be closed. Sorry for keeping it open. |
Well it can be opened and fixed up, or a new PR opened. it's fine for it to On 14 October 2016 at 06:00, Maniteja Nandana notifications@github.com
|
@maniteja123 Are you up for it? |
Hi @raghavrv, I would be happy to work on it. |
Thanks please ping me for a review when you have a PR :) |
Reference Issue
Fixes #6767
What does this implement/fix? Explain your changes.
Imputer
add_indicator_features
depends on thefit
input for the indicator matrix and should be irrespective of the transform input.Any other comments?
I have done many changes in the code, but the tests are passing. So please let me know if I changed something incorrectly.