Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() #6767

mfeurer · 2016-05-06T09:08:14Z

from sklearn.preprocessing import Imputer
import numpy as np

X_train = np.array([[1, np.NaN], [1, 1]])
X_test = np.array([[1, 1], [1, 1]])

imputer = Imputer(add_indicator_features=True)
X_train_imp = imputer.fit_transform(X_train)
print(X_train_imp.shape)

X_test_imp = imputer.transform(X_test)
print(X_test_imp.shape)

Expected Results

(2, 3)
(2, 3)

Actual Results

(2, 3)
(2, 2)

Versions

Linux-3.13.0-54-generic-x86_64-with-Ubuntu-14.04-trusty
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4]
NumPy 1.11.0
SciPy 0.17.0
Scikit-Learn 0.18.dev0

maniteja123 · 2016-05-06T11:40:29Z

Hi, thanks for the report. In the documentation for the imputer, it is mentioned

If True, the transformed X will have binary indicator features appended. These correspond to input features with at least one missing value marking which elements have been imputed.

So in the present use case, in the first case there is one feature with a missing value, hence an additional column is appended. And in the second case, since there is no missing value in any feature, no new column is appended.

Hope it helps and please do share if you have any inputs to explicitly mention this behaviour. Thanks.

mfeurer · 2016-05-06T11:57:52Z

Sorry, the actual problem is not clearly visible from the way I posted the issue.

The python code I posted aren't two use cases, they are the same use case, namely fitting a machine learning pipeline and using that later on to predict for unseen values. What's happening is that after calling fit_transform(), I have more features than after calling fit(), something which breaks a classifier used afterwards.

Actually, you are doing the very same in the examples you have checked in:

scikit-learn/examples/missing_values.py

Line 81 in 18396be

estimator = Pipeline([("imputer", Imputer(missing_values=0,

. Therefore, I think this behaviour is wrong, and transform() should always output the same amount of features as fit_transform().

maniteja123 · 2016-05-06T13:04:11Z

Sorry if I am understanding it wrong here. But the fit_transform is called on X_train while the transform is called on X_test, at least in the above mentioned snippet. Is there a case where fit_transform and transform are giving different shapes for the same input ? If so, that is definitely wrong and can you kindly give a reproducible example demonstrating that behaviour ? Thanks.

maniteja123 · 2016-05-06T13:06:55Z

And sorry if I treated them as separate use cases. I understand that fit and transform are part of the pipeline and the example was on that lines. But the part I was referring to is that the input being transformed are different in the two imputations being performed. Hope I got it right now.

mfeurer · 2016-05-06T13:25:07Z

No, there is no case where it happens that the same input leads to different output shapes. After looking at the code, and the use case I see for the Imputer, what I'm trying to say is that the array imputed_features_ should be set in the fit method and then be re-used in the transform and fit_transform method. This ensures that each call to transform, for different inputs X, results in the same amount of features. What the code right now is doing is learning which value to use for imputation, but adding the indicator features only if an imputation takes place. In order to use it in a pipeline, I would have to write custom code to ensure the output array always has the same length.

You can also reproduce the problem I have with the example missing_values.py by changing the missing rate to 0.01.

To help me understand your point of view, could you please give me an example in which it is useful to have a different amount of features in transform depending on the given input?

mfeurer · 2016-05-06T13:26:01Z

EDIT: I updated the title to make the issue more descriptive.

maniteja123 · 2016-05-06T14:12:07Z

Thank you so much for the clarification. I understand that custom code needs to be written since the output shape depends basically on the array being transformed rather than the one being used for fit. As you rightly pointed out, the array used for fit is just used to calculate the statistics and the array used for transform decides the shape of the output since it is the one being imputed. This was the approach taken in the implementation. Apologies for my ignorance but I don't have a convincing example to put across my point since it was suggested as an improvement by the core devs.

jnothman · 2016-05-07T12:24:33Z

Indeed, the output shape needs to depend on the array it was fit on, not the array being transformed. Thank you for the bug report @mfeurer, and sorry we somehow missed this in the implementation of add_indicator_features. And apologies to @maniteja123 if you were led to believe this was correct operation.

mfeurer · 2016-05-07T12:59:18Z

@jnothman No problem. Should I work on this on Monday?

jnothman · 2016-05-07T13:31:24Z

If you would like to. It may be easier for @maniteja123 to fix it up as the original implementer.

maniteja123 · 2016-05-07T16:09:41Z

Hi, thanks @jnothman for clarifying. It is me who needs to apologise for misinterpreting the function of imputed feature indicator. Thanks @mfeurer for detailed explanation for the correct implementation. Please feel free to fix the issue but I would be glad to get it done if it fine by you. Thanks.

mfeurer · 2016-05-07T17:53:43Z

I'm totally fine if you fix this @maniteja123

jnothman · 2016-05-07T21:17:11Z

In general Maniteja, the number of features output by transform must be
the same as fit_transform

On 8 May 2016 at 03:53, Matthias Feurer notifications@github.com wrote:

I'm totally fine if you fix this @maniteja123
https://github.com/maniteja123

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6767 (comment)

maniteja123 · 2016-05-08T04:07:35Z

Thanks @mfeurer and @jnothman for giving clarity over the issue. Basically, IIUC the statistics along with the indicator_features mask should be evaluated at the fit time. Is it fine if I fix it by tomorrow ?

jnothman · 2016-05-08T04:14:40Z

That'd be great, thanks!

On 8 May 2016 at 14:07, Maniteja Nandana notifications@github.com wrote:

Thanks @mfeurer https://github.com/mfeurer and @jnothman
https://github.com/jnothman for giving clarity over the issue.
Basically, IIUC the statistics along with the indicator_features mask
should be evaluated at the fit time. Is it fine if I fix it by tomorrow ?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6767 (comment)

maniteja123 · 2016-05-09T13:20:56Z

Hi, just have a question regarding the case when axis=1. Since the statistics are calculated during the transform time rather than fit, what is the approach to be taken to build the indicator matrix in this case ? Thanks.

jnothman · 2016-05-09T22:28:45Z

Unless I'm mistaken, in the axis=1 case, the existing imputer already does nothing with the fit-time statistics. They tell us nothing about the test data.

maniteja123 · 2016-05-10T00:33:04Z

Yeah yeah you are right. Sorry for not being clear, but to build the mask the statistics are used. But how can I do that at fit time if statistics are evaluated during transform ?

jnothman · 2016-05-10T00:52:52Z

You don't do it at fit time in the axis=1 case.

maniteja123 · 2016-05-10T00:55:56Z

Sorry for pestering you, but just one last question - the shape of transformed output in axis=1 will be independent of input during fit but dependent on the input during transform, right ?

jnothman · 2016-05-10T01:00:38Z

Hmm... Interesting question. The shape of transformed output needs to be
invariant from transform to transform. So I guess you must add an indicator
for every input column, regardless of whether you've seen values missing
there on the current transform. There should be a test to that effect.

On 10 May 2016 at 10:55, Maniteja Nandana notifications@github.com wrote:

Sorry for pestering you, but just one last question - the shape of
transformed output in axis=1 will be independent of input during fit but
dependent on the input during transform, right ?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6767 (comment)

jnothman · 2016-05-10T01:01:44Z

I don't really get why we're supporting axis=1 imputation anyway, but let's
go with it.

On 10 May 2016 at 11:00, Joel Nothman joel.nothman@gmail.com wrote:

Hmm... Interesting question. The shape of transformed output needs to be
invariant from transform to transform. So I guess you must add an indicator
for every input column, regardless of whether you've seen values missing
there on the current transform. There should be a test to that effect.

On 10 May 2016 at 10:55, Maniteja Nandana notifications@github.com
wrote:

Sorry for pestering you, but just one last question - the shape of
transformed output in axis=1 will be independent of input during fit but
dependent on the input during transform, right ?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6767 (comment)

maniteja123 · 2016-05-10T01:12:57Z

Thanks for clarifying. So IIUC the size of imputer matrix is always (n_samples, n_features) then right ?

jnothman · 2016-05-10T01:13:45Z

not sure what "imputer ma" is, but probably...

maniteja123 · 2016-05-10T01:14:23Z

Sorry typing from phone, typo !

maniteja123 · 2016-05-10T01:17:39Z

Thanks, will change the behaviour, modify tests and also add new tests if necessary. I will try to send a PR as soon as possible. Hope it is okay.

maniteja123 · 2016-05-10T12:06:01Z

Sorry one more question, what should imputed_features_ be representing now ? Thanks

jnothman · 2016-05-14T11:02:42Z

... then you want to keep current behaviour: drop the column.

maniteja123 · 2016-05-14T11:25:30Z

Thanks but then the column will be dropped during transform but the approach we want to fit the indicator features during fit. That was my confusion here ?

jnothman · 2016-05-14T11:48:01Z

Yes, it will be dropped during transform, and it will have no indicator
feature column. I think there should be a RuntimeWarning if there are any
non-missing values in that column at transform time.

I hope you're writing test cases along with asking these questions.

On 14 May 2016 at 21:25, Maniteja Nandana notifications@github.com wrote:

Thanks but then the column will be dropped during transform but the
approach we want to fit the indicator features during fit. That was my
confusion here ?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6767 (comment)

maniteja123 · 2016-05-14T11:54:51Z

Thanks for patiently answering and apologies for too many questions. I
needed to clarify this since now the method to generate indicator features
will change because during fit time the column with all missing values will
still be existing in the input array.

And I will make sure to cover all these scenarios in the test cases.
On 14 May 2016 5:18 pm, "Joel Nothman" notifications@github.com wrote:

Yes, it will be dropped during transform, and it will have no indicator
feature column. I think there should be a RuntimeWarning if there are any
non-missing values in that column at transform time.

I hope you're writing test cases along with asking these questions.

On 14 May 2016 at 21:25, Maniteja Nandana notifications@github.com
wrote:

Thanks but then the column will be dropped during transform but the
approach we want to fit the indicator features during fit. That was my
confusion here ?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
#6767 (comment)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6767 (comment)

jnothman · 2016-05-14T12:02:53Z

now the method to generate indicator features will change because during fit time the column with all missing values will still be existing in the input array

I don't get what you mean. The column with all missing features should be dropped in fit_transform, or in transform.

maniteja123 · 2016-05-14T12:10:08Z

Sorry, I meant in case the feature has non-missing values, you said it should be RunTimeWarning. But if there are non-missing values, the features_with_missing_values will have that column since it is evaluated using the mask of X still having the missing values (since it is now being done during fit) but valid_idx won't have it since it depends on statistics, which discards the feature with no missing values This line will thus give an IndexError here. Should I make features_with_missing_values discard that feature in such a case ?

maniteja123 · 2016-05-14T13:07:39Z

I have pushed the changes temporarily at maniteja123@e015eca. As of now, the valid_idx is generated from statistics, which discards the feature with all missing values, but features_with_missing_values_ will not take that into account since it is done during fit when the column is not yet dropped from X. Hope I am able to convey it right now.

jnothman · 2016-05-14T14:56:54Z

Please just submit a PR. Unless I'm mistaken, of course you should make features_with_missing_values discard that feature in such a case.

Again, the principle is: the data input to fit and transform will look similar (same shape[1]), and the data output from fit_transform and every subsequent will look similar transform. You collect whatever statistics you need in fit (although this happens to be nothing in the axis=1 case). Almost all of these decisions derive from this basic definition of Transformer operation.

maniteja123 · 2016-05-14T17:04:34Z

Thanks again for patiently bearing with my questions. Apologies for wasting so much of your time. But the changes needed me to change many parts of the code and I was hoping to do it with minimal digression from the present behavior and code flow. I have submitted a PR at #6782 . Kindly have a look at it and let me know if it fixes this bug and functions as expected. Thanks.

mfeurer · 2016-08-09T08:13:45Z

PING - any news on this? Can I help you guys in fixing this issue?

jnothman · 2016-08-09T08:24:09Z

Thanks for the ping, @mfeurer. I think this is an important one to get into 0.18. @maniteja123 are you still working on this?

maniteja123 · 2016-08-10T01:20:57Z

Hi, sorry for the delay. I would be glad to complete the work on it. The tests are failing for the sparse case in some previous versions. I would be thankful if someone has more insight into the reason. I will try to fixing the errors. Besides the error, it would be helpful if the functionality is the expected one now. Thanks!

jnothman · 2016-08-10T07:25:31Z

A design decision. When a feature has missing values in transform but not in fit should we:

have indicator features only for those features with missing values in training
have an indicator for all features
allow the user to specify the set of features for an indicator and otherwise default to 1 or 2
have a further "any feature has missing" indicator feature to handle this case

In a traditional fit/predict pipeline, some indicator features created by {2,3} may be constantly 0 at fit time, and so have no impact on the fitted model. However, these could benefit in a partial_fit approach... Option 4 actually allows the model to exploit knowledge of data absence in general, and indeed could be offered alongside any of the prior options.

@mfeurer, @amueller, your opinions?

mfeurer · 2016-08-10T07:47:21Z

I agree on your point why Option 1 is a bad idea and I think Option 4 is some kind of 'meta'-information which shouldn't be added by the imputer. My favourite is Option 3, with a fallback to Option 2.

mfeurer · 2016-08-10T07:51:39Z

Deleted

jnothman · 2016-08-10T10:17:17Z

I don't particularly like option 2, because certainly in the dense data case, you can waste a lot of space when only a few features are liable to be missing.

I actually do quite like option 4 -- I think its benefits from a learning perspective are most clear -- but maybe it needs another parameter, add_global_indicator_feature (yuck name!) to switch on.

Practically, I'd be happy to have option 1 or 2 implemented now by default in #6782, and then 3 and 4 can come through additional parameters later.

I don't know which test failures you're referring to.

mfeurer · 2016-08-10T12:56:48Z

Five hours later, Option 1 makes more sense than Option 2 :)

Option 3 would then override Option 1, right?

I also deleted the previous comment, sorry for that.

jnothman · 2016-08-10T13:44:24Z

Well it seems that I ask questions to then work out the answer for myself. But even better if there's consensus!

amueller · 2016-08-23T21:07:46Z

How about failing on transform if indicator features are enabled and new features have missing values and no set was specified?

amueller · 2016-08-23T21:08:36Z

I think we can leave the global for now. too many options :(

raghavrv · 2016-08-25T09:43:35Z

Option 4 is some kind of 'meta'-information which shouldn't be added by the imputer.

I agree, option 4 is basically a compounded feature. Not sure if we want to do that at imputer?

Option 1 feels implicit...

Option 3 (provisioned via a list attr missing_features) with a fallback to Option 2 (when missing_features is None) looks good to me. It is more explicit.

When the feature is a constant 0, most estimators ignore it. Maybe if you want it removed, we could add a RemoveConstantFeature transformer?

Option 4 is a separate feature request. Maybe should be raised as a separate issue?

raghavrv · 2016-08-25T09:50:06Z

How about failing on transform if indicator features are enabled and new features have missing values and no set was specified?

With no other option implemented to suppress that error, I think simply raising an error would give them no control other than to add dummy rows (to their fit data) with missing values in features that may go missing at transform.

MechCoder · 2016-08-30T05:39:10Z

Pretty late reaction but Ouch!

jnothman · 2016-08-30T11:30:46Z

Having reverted the feature, this is no longer an issue. Thanks for the input; we'll work on building it better next time.

jnothman · 2016-08-30T11:31:00Z

And thanks so much to @mfeurer for bringing this to our attention.

mfeurer changed the title ~~Imputer(with add_indicator_features=True) can create arrays of different shape in transform()~~ Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() May 6, 2016

jnothman added Bug Easy Well-defined and straightforward way to resolve labels May 7, 2016

maniteja123 mentioned this issue May 14, 2016

FIX: Imputer fix for indicator_matrix in add_indicator_features option #6782

Closed

jnothman added this to the 0.18 milestone Aug 9, 2016

jnothman mentioned this issue Aug 10, 2016

add indicator features in imputer #6556

Closed

MechCoder mentioned this issue Aug 30, 2016

[MRG+1] Add add_indicator_features option to show missing values in the output #6607

Merged

jnothman closed this as completed Sep 19, 2016

Uh oh!

Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() #6767

Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() #6767

Comments

Uh oh!

Expected Results

Actual Results

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!