FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782

maniteja123 · 2016-05-14T17:01:00Z

Reference Issue

What does this implement/fix? Explain your changes.

Imputer add_indicator_features depends on the fit input for the indicator matrix and should be irrespective of the transform input.

Any other comments?

I have done many changes in the code, but the tests are passing. So please let me know if I changed something incorrectly.

jnothman · 2016-05-14T23:19:35Z

doc/modules/preprocessing.rst

@@ -474,11 +474,11 @@ then the shape of the transformed input is
    Imputer(add_indicator_features=True, axis=0, copy=True, missing_values='NaN',
        strategy='mean', verbose=0)
    >>> print(imp_with_in.transform(X))                           # doctest: +ELLIPSIS
-    [[ 4.          2.          1.          0.        ]
-     [ 6.          3.66666667  0.          1.        ]
+    [[ 4.          2.          0.          0.        ]


What??? No, the indicator features need to correspond to the transform input. The X.shape[0] does not need to be the same for fit and transform.

Clearly, you were right to ask questions because you've still missed the point. Have you used scikit-learn's transformers? Do you understand their API design? Do you understand what this add_indicator_features is for?

Yes, we made an error in reviewing the first PR, but not one as glaring as to get this wrong!

Apologies!! This is really wrong. I have accidentally pushed the wrong commit from my other machine. I totally understand the API design and that the the fix is just about the shape of transformed output being consistent for different inputs. I will revert everything correctly.

But one final question please - now indicator features will have a column for each feature regardless of that feature having missing values in the current transform input. But the imputed_features_ will have only features with at least one missing value. The reason is I can't do X[:, expected_imputer_features] because imputed_features won't have that feature but the indicator features will still have it. This is the only thing I suppose I didn't ask about.

Sorry that I actually don't know what exactly how this indicator features help but I understand that they provide information about the missing data and thus can help improve the accuracy. And in case if you feel I am still not getting it right, please do let me know if it will be better for someone else to do this.

I'd like you to have a go, but showing what you've got early - and writing tests to show your understanding - helps make explicit checkpoints.

I was thinking a couple of hours ago about the case where a feature has missing values at transform time but not at fit. I do think this is a possible case, but I also think it's silly to add an indicator column for each of thousands of features not requiring imputation. We could actually allow the user to specify which features may require imputation (and hence both statistics and indicator matrix), with a default to infer them from the training data. What do you think?

Under shuffled cross-validation, you can't always be sure what's in your training and what's in your test. That's why allowing them to specify to Imputer may be important. However, to take back what I said, the indicator features won't be any use if they're not seen in training, so it's probably valid not to add a parameter to specify which features.

Yeah I get the point of not having it since it should be the case where you know what features are missing in training and testing data. So it is better to proceed with adding indicator features for all the features?

maniteja123 · 2016-05-15T09:27:22Z

Sorry the comments went to the outdated diff. I hope to have at least done it somewhat in the correct direction. The tests will fail right now because imputed_features_ has only imputed features while the indicator features has one for all the features irrespective of whether they are imputed or not.

Also there is a possibility of your case when input to fit doesn't have missing value for a feature but transform input might have. How should I proceed with this issue ? Thanks.

jnothman · 2016-05-15T11:12:29Z

No, I think it's better to proceed as we were: indicator features only for
those features with something (but not everything) missing at training
time. Without those properties, the imputer is not much value at transform
time.

On 15 May 2016 at 19:27, Maniteja Nandana notifications@github.com wrote:

Sorry the comments went to the outdated diff. I hope to have at least done
it somewhat in the correct direction. The tests will fail right now because
imputed_features_ has only imputed features while the indicator features
has one for all the features irrespective of whether they are imputed or
not.

Also there is a possibility of your case when input to fit doesn't have
missing value for a feature but transform input might have. How should I
proceed with this issue ? Thanks.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

jnothman · 2016-05-15T11:13:46Z

If there is no missing value at fit time, but there is at transform, do we
currently impute it? I think at this stage, we should not return an
indicator column for that feature, on the basis that there is no training
data to learn its from...

On 15 May 2016 at 21:12, Joel Nothman joel.nothman@gmail.com wrote:

No, I think it's better to proceed as we were: indicator features only for
those features with something (but not everything) missing at training
time. Without those properties, the imputer is not much value at transform
time.

On 15 May 2016 at 19:27, Maniteja Nandana notifications@github.com
wrote:

Sorry the comments went to the outdated diff. I hope to have at least
done it somewhat in the correct direction. The tests will fail right now
because imputed_features_ has only imputed features while the indicator
features has one for all the features irrespective of whether they are
imputed or not.

Also there is a possibility of your case when input to fit doesn't have
missing value for a feature but transform input might have. How should I
proceed with this issue ? Thanks.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

jnothman · 2016-05-15T11:14:01Z

*to learn its meaning from

On 15 May 2016 at 21:13, Joel Nothman joel.nothman@gmail.com wrote:

If there is no missing value at fit time, but there is at transform, do we
currently impute it? I think at this stage, we should not return an
indicator column for that feature, on the basis that there is no training
data to learn its from...

On 15 May 2016 at 21:12, Joel Nothman joel.nothman@gmail.com wrote:

No, I think it's better to proceed as we were: indicator features only
for those features with something (but not everything) missing at training
time. Without those properties, the imputer is not much value at transform
time.

On 15 May 2016 at 19:27, Maniteja Nandana notifications@github.com
wrote:

Sorry the comments went to the outdated diff. I hope to have at least
done it somewhat in the correct direction. The tests will fail right now
because imputed_features_ has only imputed features while the indicator
features has one for all the features irrespective of whether they are
imputed or not.

Also there is a possibility of your case when input to fit doesn't have
missing value for a feature but transform input might have. How should
I proceed with this issue ? Thanks.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

maniteja123 · 2016-05-15T12:19:26Z

The statistics are calculated at fit time except for features with all missing values for axis=0. So if feature has missing value in transform it does get imputed. But if I am not mistaken now depending on input to transform, the shape of indicator features matrix will again vary if we go with something but not everything.

jnothman · 2016-05-15T12:33:45Z

Then I'm sure you're mistaken. Give me an example; a test case?

jnothman · 2016-05-15T12:34:09Z

I'm not sure what you mean by your "except for", either.

maniteja123 · 2016-05-15T12:40:51Z

Sorry, I meant it will be nan for features with all missing values. And a test case, if we take test_indicator_features_shape, suppose we fit on X1. When X1 is transformed, the shape is (5, 6) [5, 3+3] but transformed X2 will be (5, 5)[5, 3+2] right ? Since we add indicator features only for features with some missing values and in X2 feature 3 has no missing value. Am I understanding it wrong ?

jnothman · 2016-05-15T12:53:29Z

The golden rule of every call to transform must output the same number of features says you must be wrong. No, we add indicator features according to missing values at fit time, not transform time.

maniteja123 · 2016-05-15T13:52:27Z

Sorry won't bother you more. I will try to show an example of what I understand here.

>>> X1 = np.array([
...     [-1,  -1,   2,   3],
...     [4,  -1,   6,  -1],
...     [8,  -1,  10,  11],
...     [12,  -1,  -1,  15],
...     [16,  -1,  18,  19]
... ])
>>> 
>>> X2 = np.array([
...     [-1,  -1,   1,   3],
...     [4,  -1,   0,  -1],
...     [8,  -1,   1,  0],
...     [0,  -1,   0,  15],
...     [16,  -1,   1,  19]
... ])

>>> imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
>>> imputer.fit(X1)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
    strategy='mean', verbose=0)
>>> imputer.transform(X1)
array([[ 10.,   2.,   3.,   1.,   0.,   0.],
       [  4.,   6.,  12.,   0.,   0.,   1.],
       [  8.,  10.,  11.,   0.,   0.,   0.],
       [ 12.,   9.,  15.,   0.,   1.,   0.],
       [ 16.,  18.,  19.,   0.,   0.,   0.]])
>>> imputer.transform(X2)
array([[ 10.,   1.,   3.,   1.,   0.,   0.],
       [  4.,   0.,  12.,   0.,   0.,   1.],
       [  8.,   1.,   0.,   0.,   0.,   0.],
       [  0.,   0.,  15.,   0.,   0.,   0.],
       [ 16.,   1.,  19.,   0.,   0.,   0.]])

>>> imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
>>> imputer.fit(X2)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
    strategy='mean', verbose=0)
>>> imputer.transform(X1)
array([[  7.  ,   2.  ,   3.  ,   1.  ,  0.  ],
       [  4.  ,   6.  ,   9.25,   0.  ,   1.  ],
       [  8.  ,  10.  ,  11.  ,   0.  ,    0.  ],
       [ 12.  ,   0.6 ,  15.  ,   0.  ,   0.  ],
       [ 16.  ,  18.  ,  19.  ,   0.  ,   0.  ]])
>>> imputer.transform(X2)
array([[  7.  ,   1.  ,   3.  ,   1.  ,   0.  ],
       [  4.  ,   0.  ,   9.25,   0.  ,   1.  ],
       [  8.  ,   1.  ,   0.  ,   0.  ,   0.  ],
       [  0.  ,   0.  ,  15.  ,   0.  ,   0.  ],
       [ 16.  ,   1.  ,  19.  ,   0.  ,  0.  ]])

Is this the expected imputation since it depends on the missing values at fit time ?

jnothman · 2016-05-15T13:59:12Z

Yes, that seems to be right.

On 15 May 2016 at 23:52, Maniteja Nandana notifications@github.com wrote:

Sorry won't bother you more. I will try to show an example of what I
understand here.

X1 = np.array([
... [-1, -1, 2, 3],
... [4, -1, 6, -1],
... [8, -1, 10, 11],
... [12, -1, -1, 15],
... [16, -1, 18, 19]
... ])

X2 = np.array([
... [-1, -1, 1, 3],
... [4, -1, 0, -1],
... [8, -1, 1, 0],
... [0, -1, 0, 15],
... [16, -1, 1, 19]
... ])

imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
imputer.fit(X1)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
strategy='mean', verbose=0)
imputer.transform(X1)
array([[ 10., 2., 3., 1., 0., 0.],
[ 4., 6., 12., 0., 0., 1.],
[ 8., 10., 11., 0., 0., 0.],
[ 12., 9., 15., 0., 1., 0.],
[ 16., 18., 19., 0., 0., 0.]])
imputer.transform(X2)
array([[ 10., 1., 3., 1., 0., 0.],
[ 4., 0., 12., 0., 0., 1.],
[ 8., 1., 0., 0., 0., 0.],
[ 0., 0., 15., 0., 0., 0.],
[ 16., 1., 19., 0., 0., 0.]])

imputer = Imputer(missing_values=-1, strategy='mean', axis=0,add_indicator_features=True)
imputer.fit(X2)
Imputer(add_indicator_features=True, axis=0, copy=True, missing_values=-1,
strategy='mean', verbose=0)
imputer.transform(X1)
array([[ 7. , 2. , 3. , 1. , 0. ],
[ 4. , 6. , 9.25, 0. , 1. ],
[ 8. , 10. , 11. , 0. , 0. ],
[ 12. , 0.6 , 15. , 0. , 0. ],
[ 16. , 18. , 19. , 0. , 0. ]])
imputer.transform(X2)
array([[ 7. , 1. , 3. , 1. , 0. ],
[ 4. , 0. , 9.25, 0. , 1. ],
[ 8. , 1. , 0. , 0. , 0. ],
[ 0. , 0. , 15. , 0. , 0. ],
[ 16. , 1. , 19. , 0. , 0. ]])

Is this the expected imputation since it depends on the missing values at
fit time ?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#6782 (comment)

maniteja123 · 2016-05-15T14:02:32Z

But if you look at transformed output of X2 then the feature with no missing values(feature 3) also has a indicator feature since it has missing value in X1, the input for fit.

And similarly for second case, the transformed output for X1 has no indicator feature for feature 3 though it has missing value since X2(input for fit) didn't have missing value [This is the case you were suggesting earlier).

Is that okay since we wanted only features with some missing values (but not everything)?

jnothman · 2016-05-15T14:11:49Z

Yes, I think that's alright.

maniteja123 · 2016-05-15T14:13:21Z

Thanks for clarifying. Will do the changes and push them in a short while once I get back home.

maniteja123 · 2016-05-23T16:53:23Z

Hi sorry for pinging but I have done the changes to the best of my understanding. Please have a look if it is the expected behavior. Thanks.

jnothman · 2016-05-26T02:53:06Z

sklearn/preprocessing/imputation.py

@@ -114,7 +114,8 @@ class Imputer(BaseEstimator, TransformerMixin):
        The imputation fill value for each feature if axis == 0.

    imputed_features_ : array of shape (n_features_with_missing, )
-        The input features which have been imputed during transform.
+        This attribute is available only when ``axis=0``.


Can you put this after the following sentence, please?

maniteja123 · 2016-05-27T13:25:28Z

Sorry the pep8 was old in my system which I failed to observe. I have updated it and fixed almost all except some where there are more than 79 characters. Right now , the shape of output should be independent of input shape of transform and dependent on fit only. Please have a look at your convenience and let me know if there is anything else to done. Thanks.

jnothman · 2016-05-28T11:24:17Z

except some where there are more than 79 characters

we do generally adhere to that requirement

jnothman · 2016-05-29T14:08:17Z

sklearn/preprocessing/imputation.py

+                                          dtype=X.dtype)
+            inv_mask_matrix.eliminate_zeros()  # removes explicit False entries
+            all_missing_ = inv_mask_matrix.sum(axis=0).A.nonzero()[1]
+            self.mask_matrix_ = self.mask_matrix_[:, self.imputed_features_]


we should not be setting any attributes at transform time

jnothman · 2016-08-10T22:13:57Z

sklearn/preprocessing/imputation.py

@@ -355,6 +355,8 @@ def _sparse_transform(self, X, statistics):

        if self.axis == 0:
            all_missing = np.where(mask_matrix.sum(axis=0) == X.shape[0])[0]
+            if isinstance(all_missing, np.matrix):


Please comment that this is for old scipy

jnothman · 2016-08-10T23:42:48Z

Thanks @maniteja123. Apart from not being sure that it's sufficiently documented, this is looking pretty good to what we specified. Even so (sorry @mfeurer), I continue to be undecided on whether it's the correct approach.

I'm now wondering, for instance: why, if we're only adding indicators for features with training-time missing values when axis=0, are we not doing the same for axis=1? After all, that restriction is really about features being useless if they're constant at training time. Maybe the axis=1 case should use imputed_features_ identically to axis=0.

Alternatively, we could just default to including all features in the indicator matrix, unless specified by providing a list of feature indices like add_indicator_features=[0, 2, 3]. In that case, we have no need for imputed_features_.

jnothman · 2016-08-17T13:39:28Z

@maniteja123, I'd like to make some changes to this myself, in part because I keep changing the spec, in part because I'm a little anxious to make sure this makes it into 0.18 (or else IMO we'll need to revert add_indicator_features). It's now looking close, but there are still some bits and pieces that could be neatened up. Is it okay if I offer pull requests to your branch? Alternatively, I could make a new PR from my own branch, and take ownership of the process of fixing it up for merge (either way we'll need two others to review).

maniteja123 · 2016-08-17T16:39:19Z

Sure Joel, please do take over the work in case you wish to. I understand that this needs to come to a conclusive state before the release. Sorry for being unable to complete it in time. Do let me know if I can do anything to assist.

raghavrv · 2016-10-13T16:18:39Z

Any updates on this? I need this for my benchmarks...

raghavrv · 2016-10-13T17:01:12Z

sklearn/preprocessing/imputation.py

+                features = np.setdiff1d(self.all_missing_, all_missing)
+                warnings.warn("The features %s have all missing "
+                              "values in fit but have non-missing values"
+                              "in transform " % features, RuntimeWarning,


space before in

raghavrv · 2016-10-13T17:01:36Z

sklearn/preprocessing/imputation.py

+                features = np.setdiff1d(self.all_missing_, all_missing)
+                warnings.warn("The features %s have all missing"
+                              " values in fit but have non-missing values in"
+                              "transform" % features, RuntimeWarning,


space before transform

amueller · 2016-10-13T17:07:30Z

@raghavrv wait it this PR still relevant? The missing feature support was completely removed from master by @jnothman and needs a rewrite.

raghavrv · 2016-10-13T17:08:46Z

This wasn't the rewrite?

raghavrv · 2016-10-13T17:09:38Z

My bad this simply tries to address the issue raised by @mfeurer... Should we close this then?

amueller · 2016-10-13T17:10:59Z

If this is a patch against the old implementation, which I think it is, then yes. @maniteja123 can you confirm?

raghavrv · 2016-10-13T17:14:00Z

Indeed it is.

maniteja123 · 2016-10-13T19:00:33Z

Hi yeah it is based on old implementation. I suppose it can be closed. Sorry for keeping it open.

jnothman · 2016-10-13T21:23:05Z

Well it can be opened and fixed up, or a new PR opened. it's fine for it to
be based on the old from a code perspective: GH'll take the appropriate
diff. But the discussion here is less relevant

On 14 October 2016 at 06:00, Maniteja Nandana notifications@github.com
wrote:

Closed #6782 #6782.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6782 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61eG_h1JAyD6suuzZ-mG0pan4p-0ks5qzn_SgaJpZM4IephV
.

raghavrv · 2016-10-22T20:23:13Z

@maniteja123 Are you up for it?

maniteja123 · 2016-10-23T16:51:27Z

Hi @raghavrv, I would be happy to work on it.

raghavrv · 2016-10-23T16:55:18Z

Thanks please ping me for a review when you have a PR :)

maniteja123 changed the title ~~Imputer fix~~ Imputer fix for indicator_matrix in add_indicator_features option May 14, 2016

maniteja123 mentioned this pull request May 14, 2016

Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() #6767

Closed

jnothman reviewed May 14, 2016
View reviewed changes

maniteja123 changed the title ~~Imputer fix for indicator_matrix in add_indicator_features option~~ FIX: Imputer fix for indicator_matrix in add_indicator_features option May 15, 2016

jnothman reviewed May 26, 2016
View reviewed changes

jnothman reviewed May 29, 2016
View reviewed changes

maniteja123 added 3 commits August 11, 2016 01:03

try to fix sparse bug

929d3dd

convert matrix to ndarray

1b04d53

Fix tests

b964b84

jnothman reviewed Aug 10, 2016
View reviewed changes

jnothman added the Blocker label Aug 22, 2016

jnothman mentioned this pull request Aug 23, 2016

add indicator features in imputer #6556

Closed

MechCoder mentioned this pull request Aug 30, 2016

[MRG+1] Add add_indicator_features option to show missing values in the output #6607

Merged

jnothman modified the milestones: 0.19, 0.18 Aug 31, 2016

jnothman removed the Blocker label Aug 31, 2016

raghavrv reviewed Oct 13, 2016

View reviewed changes

raghavrv suggested changes Oct 13, 2016

View reviewed changes

maniteja123 closed this Oct 13, 2016

Uh oh!

FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782

FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782

Uh oh!

Conversation

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!