-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH: handle missing values in OneHotEncoder #12025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -540,6 +540,15 @@ columns for this feature will be all zeros | |||
array([[1., 0., 0., 0., 0., 0.]]) | |||
|
|||
|
|||
Missing categorical features in the training data can be handled by specifying what happens to them using the ``handle_missing`` parameter. The values for this can be one of : | |||
|
|||
`all-missing`: This will replace all missing rows with NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of operational description belongs in the class docstring. Here you would focus on the benefits or use-cases of one or another approach.
sklearn/preprocessing/_encoders.py
Outdated
handle_missing : all-missing, all-zero or category | ||
What should be done to missing values. Should be one of: | ||
|
||
all-missing: Replace with a row of NaNs as above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better rendering, use restructured text definition list:
all-missing
Replace with a row of NaNs as above
all-zero
Replace with a row of zeros
What do you mean by "as above"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the last commit.
as above
is a typo.
|
||
category: Represent with a separate one-hot column | ||
|
||
missing_values: NaN or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.
+1
sklearn/preprocessing/_encoders.py
Outdated
return _transform_selected(X, self._legacy_transform, self.dtype, | ||
if not self.missing_values: | ||
if self._legacy_mode: | ||
return _transform_selected(X, self._legacy_transform, self.dtype, | ||
self._categorical_features, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect indentation
sklearn/preprocessing/_encoders.py
Outdated
@@ -567,12 +581,30 @@ def transform(self, X): | |||
X_out : sparse matrix if sparse=True else a 2-d array | |||
Transformed input. | |||
""" | |||
if self._legacy_mode: | |||
return _transform_selected(X, self._legacy_transform, self.dtype, | |||
if not self.missing_values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by if not self.missing_values
sklearn/preprocessing/_encoders.py
Outdated
@@ -260,13 +272,15 @@ class OneHotEncoder(_BaseEncoder): | |||
|
|||
def __init__(self, n_values=None, categorical_features=None, | |||
categories=None, sparse=True, dtype=np.float64, | |||
handle_unknown='error'): | |||
handle_unknown='error', missing_values=None, handle_missing=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make handle_missing='all-missing'
the default
# present during transform. | ||
oh = OneHotEncoder(handle_unknown='error', handle_missing='abcde') | ||
oh.fit(X) | ||
assert_raises(ValueError, oh.transform, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please test that an appropriate error message is raised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on the tests now.
@Olamyy Do you have time to update this PR? |
take |
@nilichen There is a bunch of discussion surrounding this issue. Please look at #11996 for details. Especially look at #11996 (comment) Since the |
@thomasjpfan Thanks for the info! There seems to be some discussion around implementing EDIT: And if I understand correctly, impute first with constant value then My view would be to have |
To be clear, I was referring to only supporting NaN as a representation of
missing values in the input to OneHotEncoder.
@amueller and @ogrisel's comments regard the *output* of
OneHotEncoder.transform, saying that handle_missing='all-missing' is not a
useful option. I think we're all happy to exclude that option.
|
That makes much more sense. Thank you for the clarification!
…On Tue, Mar 17, 2020 at 4:06 PM Joel Nothman ***@***.***> wrote:
To be clear, I was referring to only supporting NaN as a representation of
missing values in the input to OneHotEncoder.
@amueller and @ogrisel's comments regard the *output* of
OneHotEncoder.transform, saying that handle_missing='all-missing' is not a
useful option. I think we're all happy to exclude that option.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12025 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACFAWH5PMIFV2UTY5XCZ5B3RH766LANCNFSM4FTSR4GQ>
.
|
Closing because this was fixed in #17317 |
Reference Issues/PRs
Fixes #11996
What does this implement/fix? Explain your changes.
Currently contains 3 edits: