8000 FIX: Imputer fix for indicator_matrix in add_indicator_features option by maniteja123 · Pull Request #6782 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

maniteja123
Copy link
Contributor

Reference Issue

Fixes #6767

What does this implement/fix? Explain your changes.

Imputer add_indicator_features depends on the fit input for the indicator matrix and should be irrespective of the transform input.

Any other comments?

I have done many changes in the code, but the tests are passing. So please let me know if I changed something incorrectly.

@maniteja123 maniteja123 changed the title Imputer fix Imputer fix for indicator_matrix in add_indicator_features option May 14, 2016
@@ -474,11 +474,11 @@ then the shape of the transformed input is
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values='NaN',
strategy='mean', verbose=0)
>>> print(imp_with_in.transform(X)) # doctest: +ELLIPSIS
[[ 4. 2. 1. 0. ]
[ 6. 3.66666667 0. 1. ]
[[ 4. 2. 0. 0. ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What??? No, the indicator features need to correspond to the transform input. The X.shape[0] does not need to be the same for fit and transform.

Copy link
Member
@jnothman jnothman May 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly, you were right to ask questions because you've still missed the point. Have you used scikit-learn's transformers? Do you understand their API design? Do you understand what this add_indicator_features is for?

Yes, we made an error in reviewing the first PR, but not one as glaring as to get this wrong!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies!! This is really wrong. I have accidentally pushed the wrong commit from my other machine. I totally understand the API design and that the the fix is just about the shape of transformed output being consistent for different inputs. I will revert everything correctly.

But one final question please - now indicator features will have a column for each feature regardless of that feature having missing values in the current transform input. But the imputed_features_ will have only features with at least one missing value. The reason is I can't do X[:, expected_imputer_features] because imputed_features won't have that feature but the indicator features will still have it. This is the only thing I suppose I didn't ask about.

Sorry that I actually don't know what exactly how this indicator features help but I understand that they provide information about the missing data and thus can help improve the accuracy. And in case if you feel I am still not getting it right, please do let me know if it will be better for someone else to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like you to have a go, but showing what you've got early - and writing tests to show your understanding - helps make explicit checkpoints.

I was thinking a couple of hours ago about the case where a feature has missing values at transform time but not at fit. I do think this is a possible case, but I also think it's silly to add an indicator column for each of thousands of features not requiring imputation. We could actually allow the user to specify which features may require imputation (and hence both statistics and indicator matrix), with a default to infer them from the training data. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, will definitely do that.

The idea of user specifying seems good but would it always be the case that the user is aware of the features with missing data in the whole training data ?. And even if they do, will it be consistent for all the samples ?

Also if we proceed to add indicator features for every feature irrespective of imputation, something like variance threshold can remove such indicators features that correspond to features with no missing values right ? Or maybe give an option specifically for the indicator features to have all features or just imputed features ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under shuffled cross-validation, you can't always be sure what's in your training and what's in your test. That's why allowing them to specify to Imputer may be important. However, to take back what I said, the indicator features won't be any use if they're not seen in training, so it's probably valid not to add a parameter to specify which features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I get the point of not having it since it should be the case where you know what features are missing in training and testing data. So it is better to proceed with adding indicator features for all the features?

@maniteja123
Copy link
Contributor Author

Sorry the comments went to the outdated diff. I hope to have at least done it somewhat in the correct direction. The tests will fail right now because imputed_features_ has only imputed features while the indicator features has one for all the features irrespective of whether they are imputed or not.

Also there is a possibility of your case when input to fit doesn't have missing value for a feature but transform input might have. How should I proceed with this issue ? Thanks.

@jnothman
Copy link
Member

No, I think it's better to proceed as we were: indicator features only for
those features with something (but not everything) missing at training
time. Without those properties, the imputer is not much value at transform
time.

On 15 May 2016 at 19:27, Maniteja Nandana notifications@github.com wrote:

Sorry the comments went to the outdated diff. I hope to have at least done
it somewhat in the correct direction. The tests will fail right now because
imputed_features_ has only imputed features while the indicator features
has one for all the features irrespective of whether they are imputed or
not.

Also there is a possibility of your case when input to fit doesn't have
missing value for a feature but transform input might have. How should I
proceed with this issue ? Thanks.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

@jnothman
Copy link
Member

If there is no missing value at fit time, but there is at transform, do we
currently impute it? I think at this stage, we should not return an
indicator column for that feature, on the basis that there is no training
data to learn its from...

On 15 May 2016 at 21:12, Joel Nothman joel.nothman@gmail.com wrote:

No, I think it's better to proceed as we were: indicator features only for
those features with something (but not everything) missing at training
time. Without those properties, the imputer is not much value at transform
time.

On 15 May 2016 at 19:27, Maniteja Nandana notifications@github.com
wrote:

Sorry the comments went to the outdated diff. I hope to have at least
done it somewhat in the correct direction. The tests will fail right now
because imputed_features_ has only imputed features while the indicator
features has one for all the features irrespective of whether they are
imputed or not.

Also there is a possibility of your case when input to fit doesn't have
missing value for a feature but transform input might have. How should I
proceed with this issue ? Thanks.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

@jnothman
Copy link
Member

*to learn its meaning from

On 15 May 2016 at 21:13, Joel Nothman joel.nothman@gmail.com wrote:

If there is no missing value at fit time, but there is at transform, do we
currently impute it? I think at this stage, we should not return an
indicator column for that feature, on the basis that there is no training
data to learn its from...

On 15 May 2016 at 21:12, Joel Nothman joel.nothman@gmail.com wrote:

No, I think it's better to proceed as we were: indicator features only
for those features with something (but not everything) missing at training
time. Without those properties, the imputer is not much value at transform
time.

On 15 May 2016 at 19:27, Maniteja Nandana notifications@github.com
wrote:

Sorry the comments went to the outdated diff. I hope to have at least
done it somewhat in the correct direction. The tests will fail right now
because imputed_features_ has only imputed features while the indicator
features has one for all the features irrespective of whether they are
imputed or not.

Also there is a possibility of your case when input to fit doesn't have
missing value for a feature but transform input might have. How should
I proceed with this issue ? Thanks.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

@maniteja123
Copy link
Contributor Author

The statistics are calculated at fit time except for features with all missing values for axis=0. So if feature has missing value in transform it does get imputed. But if I am not mistaken now depending on input to transform, the shape of indicator features matrix will again vary if we go with something but not everything.

@jnothman
Copy link
Member

Then I'm sure you're mistaken. Give me an example; a test case?

@jnothman
Copy link
Member

I'm not sure what you mean by your "except for", either.

@maniteja123
Copy link
Contributor Author

Sorry, I meant it will be nan for features with all missing values. And a test case, if we take test_indicator_features_shape, suppose we fit on X1. When X1 is transformed, the shape is (5, 6) [5, 3+3] but transformed X2 will be (5, 5)[5, 3+2] right ? Since we add indicator features only for features with some missing values and in X2 feature 3 has no missing value. Am I understanding it wrong ?

@jnothman
Copy link
Member

The golden rule of every call to transform must output the same number of features says you must be wrong. No, we add indicator features according to missing values at fit time, not transform time.

@maniteja123
Copy link
Contributor Author

Sorry won't bother you more. I will try to show an example of what I understand here.

>>> X1 = np.array([
...     [-1,  -1,   2,   3],
...     [4,  -1,   6,  -1],
...     [8,  -1,  10,  11],
...     [12,  -1,  -1,  15],
...     [16,  -1,  18,  19]
... ])
>>> 
>>> X2 = np.array([
...     [-1,  -1,   1,   3],
...     [4,  -1,   0,  -1],
...     [8,  -1,   1,  0],
...     [0,  -1,   0,  15],
...     [16,  -1,   1,  19]
... ])

>>> imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
>>> imputer.fit(X1)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
    strategy='mean', verbose=0)
>>> imputer.transform(X1)
array([[ 10.,   2.,   3.,   1.,   0.,   0.],
       [  4.,   6.,  12.,   0.,   0.,   1.],
       [  8.,  10.,  11.,   0.,   0.,   0.],
       [ 12.,   9.,  15.,   0.,   1.,   0.],
       [ 16.,  18.,  19.,   0.,   0.,   0.]])
>>> imputer.transform(X2)
array([[ 10.,   1.,   3.,   1.,   0.,   0.],
       [  4.,   0.,  12.,   0.,   0.,   1.],
       [  8.,   1.,   0.,   0.,   0.,   0.],
       [  0.,   0.,  15.,   0.,   0.,   0.],
       [ 16.,   1.,  19.,   0.,   0.,   0.]])

>>> imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
>>> imputer.fit(X2)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
    strategy='mean', verbose=0)
>>> imputer.transform(X1)
array([[  7.  ,   2.  ,   3.  ,   1.  ,  0.  ],
       [  4.  ,   6.  ,   9.25,   0.  ,   1.  ],
       [  8.  ,  10.  ,  11.  ,   0.  ,    0.  ],
       [ 12.  ,   0.6 ,  15.  ,   0.  ,   0.  ],
       [ 16.  ,  18.  ,  19.  ,   0.  ,   0.  ]])
>>> imputer.transform(X2)
array([[  7.  ,   1.  ,   3.  ,   1.  ,   0.  ],
       [  4.  ,   0.  ,   9.25,   0.  ,   1.  ],
       [  8.  ,   1.  ,   0.  ,   0.  ,   0.  ],
       [  0.  ,   0.  ,  15.  ,   0.  ,   0.  ],
       [ 16.  ,   1.  ,  19.  ,   0.  ,  0.  ]])

Is this the expected imputation since it depends on the missing values at fit time ?

@jnothman
Copy link
Member

Yes, that seems to be right.

On 15 May 2016 at 23:52, Maniteja Nandana notifications@github.com wrote:

Sorry won't bother you more. I will try to show an example of what I
understand here.

X1 = np.array([
... [-1, -1, 2, 3],
... [4, -1, 6, -1],
... [8, -1, 10, 11],
... [12, -1, -1, 15],
... [16, -1, 18, 19]
... ])

X2 = np.array([
... [-1, -1, 1, 3],
... [4, -1, 0, -1],
... [8, -1, 1, 0],
... [0, -1, 0, 15],
... [16, -1, 1, 19]
... ])

imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
imputer.fit(X1)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
strategy='mean', verbose=0)
imputer.transform(X1)
array([[ 10., 2., 3., 1., 0., 0.],
[ 4., 6., 12., 0., 0., 1.],
[ 8., 10., 11., 0., 0., 0.],
[ 12., 9., 15., 0., 1., 0.],
[ 16., 18., 19., 0., 0., 0.]])
imputer.transform(X2)
array([[ 10., 1., 3., 1., 0., 0.],
[ 4., 0., 12., 0., 0., 1.],
[ 8., 1., 0., 0., 0., 0.],
[ 0., 0., 15., 0., 0., 0.],
[ 16., 1., 19., 0., 0., 0.]])

imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
imputer.fit(X2)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
strategy='mean', verbose=0)
imputer.transform(X1)
array([[ 7. , 2. , 3. , 1. , 0. ],
[ 4. , 6. , 9.25, 0. , 1. ],
[ 8. , 10. , 11. , 0. , 0. ],
[ 12. , 0.6 , 15. , 0. , 0. ],
[ 16. , 18. , 19. , 0. , 0. ]])
imputer.transform(X2)
array([[ 7. , 1. , 3. , 1. , 0. ],
[ 4. , 0. , 9.25, 0. , 1. ],
[ 8. , 1. , 0. , 0. , 0. ],
[ 0. , 0. , 15. , 0. , 0. ],
[ 16. , 1. , 19. , 0. , 0. ]])

Is this the expected imputation since it depends on the missing values at
fit time ?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

@maniteja123
Copy link
Contributor Author

But if you look at transformed output of X2 then the feature with no missing values(feature 3) also has a indicator feature since it has missing value in X1, the input for fit.

And similarly for second case, the transformed output for X1 has no indicator feature for feature 3 though it has missing value since X2(input for fit) didn't have missing value [This is the case you were suggesting earlier).

Is that okay since we wanted only features with some missing values (but not everything)?

@jnothman
Copy link
Member

Yes, I think that's alright.

@maniteja123
Copy link
Contributor Author

Thanks for clarifying. Will do the changes and push them in a short while once I get back home.

@maniteja123 maniteja123 changed the title Imputer fix for indicator_matrix in add_indicator_features option FIX: Imputer fix for indicator_matrix in add_indicator_features option May 15, 2016
@maniteja123
Copy link
Contributor Author

Hi sorry for pinging but I have done the changes to the best of my understanding. Please have a look if it is the expected behavior. Thanks.

@@ -114,7 +114,8 @@ class Imputer(BaseEstimator, TransformerMixin):
The imputation fill value for each feature if axis == 0.

imputed_features_ : array of shape (n_features_with_missing, )
The input features which have been imputed during transform.
This attribute is available only when ``axis=0``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this after the following sentence, please?

@maniteja123
Copy link
Contributor Author

Sorry the pep8 was old in my system which I failed to observe. I have updated it and fixed almost all except some where there are more than 79 characters. Right now , the shape of output should be independent of input shape of transform and dependent on fit only. Please have a look at your convenience and let me know if there is anything else to done. Thanks.

@jnothman
Copy link
Member

except some where there are more than 79 characters

we do generally adhere to that requirement

dtype=X.dtype)
inv_mask_matrix.eliminate_zeros() # removes explicit False entries
all_missing_ = inv_mask_matrix.sum(axis=0).A.nonzero()[1]
self.mask_matrix_ = self.mask_matrix_[:, self.imputed_features_]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not be setting any attributes at transform time

@@ -355,6 +355,8 @@ def _sparse_transform(self, X, statistics):

if self.axis == 0:
all_missing = np.where(mask_matrix.sum(axis=0) == X.shape[0])[0]
if isinstance(all_missing, np.matrix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment that this is for old scipy

@jnothman
Copy link
Member

Thanks @maniteja123. Apart from not being sure that it's sufficiently documented, this is looking pretty good to what we specified. Even so (sorry @mfeurer), I continue to be undecided on whether it's the correct approach.

I'm now wondering, for instance: why, if we're only adding indicators for features with training-time missing values when axis=0, are we not doing the same for axis=1? After all, that restriction is really about features being useless if they're constant at training time. Maybe the axis=1 case should use imputed_features_ identically to axis=0.

Alternatively, we could just default to including all features in the indicator matrix, unless specified by providing a list of feature indices like add_indicator_features=[0, 2, 3]. In that case, we have no need for imputed_features_.

@jnothman
Copy link
Member

@maniteja123, I'd like to make some changes to this myself, in part because I keep changing the spec, in part because I'm a little anxious to make sure this makes it into 0.18 (or else IMO we'll need to revert add_indicator_features). It's now looking close, but there are still some bits and pieces that could be neatened up. Is it okay if I offer pull requests to your branch? Alternatively, I could make a new PR from my own branch, and take ownership of the process of fixing it up for merge (either way we'll need two others to review).

@maniteja123
Copy link
Contributor Author

Sure Joel, please do take over the work in case you wish to. I understand that this needs to come to a conclusive state before the release. Sorry for being unable to complete it in time. Do let me know if I can do anything to assist.

@raghavrv
Copy link
Member

Any updates on this? I need this for my benchmarks...

features = np.setdiff1d(self.all_missing_, all_missing)
warnings.warn("The features %s have all missing "
"values in fit but have non-missing values"
"in transform " % features, RuntimeWarning,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before in

features = np.setdiff1d(self.all_missing_, all_missing)
warnings.warn("The features %s have all missing"
" values in fit but have non-missing values in"
"transform" % features, RuntimeWarning,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before transform

@amueller
Copy link
Member

@raghavrv wait it this PR still relevant? The missing feature support was completely removed from master by @jnothman and needs a rewrite.

@raghavrv
Copy link
Member

This wasn't the rewrite?

@raghavrv
Copy link
Member

My bad this simply tries to address the issue raised by @mfeurer... Should we close this then?

@amueller
Copy link
Member

If this is a patch against the old implementation, which I think it is, then yes. @maniteja123 can you confirm?

@raghavrv
Copy link
Member

Indeed it is.

@maniteja123
Copy link
Contributor Author

Hi yeah it is based on old implementation. I suppose it can be closed. Sorry for keeping it open.

@jnothman
Copy link
Member

Well it can be opened and fixed up, or a new PR opened. it's fine for it to
be based on the old from a code perspective: GH'll take the appropriate
diff. But the discussion here is less relevant

On 14 October 2016 at 06:00, Maniteja Nandana notifications@github.com
wrote:

Closed #6782 #6782.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61eG_h1JAyD6suuzZ-mG0pan4p-0ks5qzn_SgaJpZM4IephV
.

@raghavrv
Copy link
Member

@maniteja123 Are you up for it?

@maniteja123
Copy link
Contributor Author

Hi @raghavrv, I would be happy to work on it.

@raghavrv
Copy link
Member

Thanks please ping me for a review when you have a PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform()
4 participants
0