-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Imputer(with add_indicator_features=True) can create arrays of different shape in transform() than in fit_transform() #6767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, thanks for the report. In the documentation for the imputer, it is mentioned
So in the present use case, in the first case there is one feature with a missing value, hence an additional column is appended. And in the second case, since there is no missing value in any feature, no new column is appended. Hope it helps and please do share if you have any inputs to explicitly mention this behaviour. Thanks. |
Sorry, the actual problem is not clearly visible from the way I posted the issue. The python code I posted aren't two use cases, they are the same use case, namely fitting a machine learning pipeline and using that later on to predict for unseen values. What's happening is that after calling fit_transform(), I have more features than after calling fit(), something which breaks a classifier used afterwards. Actually, you are doing the very same in the examples you have checked in: scikit-learn/examples/missing_values.py Line 81 in 18396be
|
Sorry if I am understanding it wrong here. But the fit_transform is called on X_train while the transform is called on X_test, at least in the above mentioned snippet. Is there a case where fit_transform and transform are giving different shapes for the same input ? If so, that is definitely wrong and can you kindly give a reproducible example demonstrating that behaviour ? Thanks. |
And sorry if I treated them as separate use cases. I understand that fit and transform are part of the pipeline and the example was on that lines. But the part I was referring to is that the input being transformed are different in the two imputations being performed. Hope I got it right now. |
No, there is no case where it happens that the same input leads to different output shapes. After looking at the code, and the use case I see for the Imputer, what I'm trying to say is that the array You can also reproduce the problem I have with the example To help me understand your point of view, could you please give me an example in which it is useful to have a different amount of features in |
EDIT: I updated the title to make the issue more descriptive. |
Thank you so much for the clarification. I understand that custom code needs to be written since the output shape depends basically on the array being transformed rather than the one being used for fit. As you rightly pointed out, the array used for fit is just used to calculate the statistics and the array used for transform decides the shape of the output since it is the one being imputed. This was the approach taken in the implementation. Apologies for my ignorance but I don't have a convincing example to put across my point since it was suggested as an improvement by the core devs. |
Indeed, the output shape needs to depend on the array it was fit on, not the array being transformed. Thank you for the bug report @mfeurer, and sorry we somehow missed this in the implementation of |
@jnothman No problem. Should I work on this on Monday? |
If you would like to. It may be easier for @maniteja123 to fix it up as the original implementer. |
Hi, thanks @jnothman for clarifying. It is me who needs to apologise for misinterpreting the function of imputed feature indicator. Thanks @mfeurer for detailed explanation for the correct implementation. Please feel free to fix the issue but I would be glad to get it done if it fine by you. Thanks. |
I'm totally fine if you fix this @maniteja123 |
In general Maniteja, the number of features output by On 8 May 2016 at 03:53, Matthias Feurer notifications@github.com wrote:
|
That'd be great, thanks! On 8 May 2016 at 14:07, Maniteja Nandana notifications@github.com wrote:
|
Hi, just have a question regarding the case when |
Unless I'm mistaken, in the |
Yeah yeah you are right. Sorry for not being clear, but to build the mask the statistics are used. But how can I do that at fit time if statistics are evaluated during transform ? |
You don't do it at fit time in the axis=1 case. |
Sorry for pestering you, but just one last question - the shape of transformed output in axis=1 will be independent of input during fit but dependent on the input during transform, right ? |
Hmm... Interesting question. The shape of transformed output needs to be On 10 May 2016 at 10:55, Maniteja Nandana notifications@github.com wrote:
|
I don't really get why we're supporting axis=1 imputation anyway, but let's On 10 May 2016 at 11:00, Joel Nothman joel.nothman@gmail.com wrote:
|
Thanks for clarifying. So IIUC the size of imputer matrix is always (n_samples, n_features) then right ? |
not sure what "imputer ma" is, but probably... |
Sorry typing from phone, typo ! |
Thanks, will change the behaviour, modify tests and also add new tests if necessary. I will try to send a PR as soon as possible. Hope it is okay. |
Sorry one more question, what should |
... then you want to keep current behaviour: drop the column. |
Thanks but then the column will be dropped during transform but the approach we want to fit the indicator features during fit. That was my confusion here ? |
Yes, it will be dropped during transform, and it will have no indicator I hope you're writing test cases along with asking these questions. On 14 May 2016 at 21:25, Maniteja Nandana notifications@github.com wrote:
|
Thanks for patiently answering and apologies for too many questions. I And I will make sure to cover all these scenarios in the test cases.
|
I don't get what you mean. The column with all missing features should be dropped in |
Sorry, I meant in case the feature has non-missing values, you said it should be RunTimeWarning. But if there are non-missing values, the |
I have pushed the changes temporarily at maniteja123@e015eca. As of now, the |
Please just submit a PR. Unless I'm mistaken, of course you should make Again, the principle is: the data input to |
Thanks again for patiently bearing with my questions. Apologies for wasting so much of your time. But the changes needed me to change many parts of the code and I was hoping to do it with minimal digression from the present behavior and code flow. I have submitted a PR at #6782 . Kindly have a look at it and let me know if it fixes this bug and functions as expected. Thanks. |
PING - any news on this? Can I help you guys in fixing this issue? |
Thanks for the ping, @mfeurer. I think this is an important one to get into 0.18. @maniteja123 are you still working on this? |
Hi, sorry for the delay. I would be glad to complete the work on it. The tests are failing for the sparse case in some previous versions. I would be thankful if someone has more insight into the reason. I will try to fixing the errors. Besides the error, it would be helpful if the functionality is the expected one now. Thanks! |
A design decision. When a feature has missing values in
In a traditional fit/predict pipeline, some indicator features created by {2,3} may be constantly 0 at fit time, and so have no impact on the fitted model. However, these could benefit in a |
I agree on your point why Option 1 is a bad idea and I think Option 4 is some kind of 'meta'-information which shouldn't be added by the imputer. My favourite is Option 3, with a fallback to Option 2. |
Deleted |
I don't particularly like option 2, because certainly in the dense data case, you can waste a lot of space when only a few features are liable to be missing. I actually do quite like option 4 -- I think its benefits from a learning perspective are most clear -- but maybe it needs another parameter, Practically, I'd be happy to have option 1 or 2 implemented now by default in #6782, and then 3 and 4 can come through additional parameters later. I don't know which test failures you're referring to. |
Five hours later, Option 1 makes more sense than Option 2 :) Option 3 would then override Option 1, right? I also deleted the previous comment, sorry for that. |
Well it seems that I ask questions to then work out the answer for myself. But even better if there's consensus! |
How about failing on transform if indicator features are enabled and new features have missing values and no set was specified? |
I think we can leave the global for now. too many options :( |
I agree, option 4 is basically a compounded feature. Not sure if we want to do that at imputer? Option 1 feels implicit... Option 3 (provisioned via a list attr When the feature is a constant 0, most estimators ignore it. Maybe if you want it removed, we could add a Option 4 is a separate feature request. Maybe should be raised as a separate issue? |
With no other option implemented to suppress that error, I think simply raising an error would give them no control other than to add dummy rows (to their |
Pretty late reaction but Ouch! |
Having reverted the feature, this is no longer an issue. Thanks for the input; we'll work on building it better next time. |
And thanks so much to @mfeurer for bringing this to our attention. |
Uh oh!
There was an error while loading. Please reload this page.
Expected Results
(2, 3)
(2, 3)
Actual Results
(2, 3)
(2, 2)
Versions
Linux-3.13.0-54-generic-x86_64-with-Ubuntu-14.04-trusty
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4]
NumPy 1.11.0
SciPy 0.17.0
Scikit-Learn 0.18.dev0
The text was updated successfully, but these errors were encountered: