8000 ENH: handle missing values in OneHotEncoder by Olamyy · Pull Request #12025 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH: handle missing values in OneHotEncoder #12025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

Olamyy
Copy link
@Olamyy Olamyy commented Sep 6, 2018

Reference Issues/PRs

Fixes #11996

What does this implement/fix? Explain your changes.

Currently contains 3 edits:

  1. Tests to check handle_missing and missing_values are passed and have correct values
  2. An update to the OneHotEncoder docstring and the preprocessing module.
  3. An initial implementation logic of the required features as stated by Handle missing values in OneHotEncoder #11996

@@ -540,6 +540,15 @@ columns for this feature will be all zeros
array([[1., 0., 0., 0., 0., 0.]])


Missing categorical features in the training data can be handled by specifying what happens to them using the ``handle_missing`` parameter. The values for this can be one of :

`all-missing`: This will replace all missing rows with NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of operational description belongs in the class docstring. Here you would focus on the benefits or use-cases of one or another approach.

handle_missing : all-missing, all-zero or category
What should be done to missing values. Should be one of:

all-missing: Replace with a row of NaNs as above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better rendering, use restructured text definition list:

all-missing
    Replace with a row of NaNs as above
all-zero
    Replace with a row of zeros

What do you mean by "as above"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the last commit.

as above is a typo.


category: Represent with a separate one-hot column

missing_values: NaN or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.

+1

return _transform_selected(X, self._legacy_transform, self.dtype,
if not self.missing_values:
if self._legacy_mode:
return _transform_selected(X, self._legacy_transform, self.dtype,
self._categorical_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation

@@ -567,12 +581,30 @@ def transform(self, X):
X_out : sparse matrix if sparse=True else a 2-d array
Transformed input.
"""
if self._legacy_mode:
return _transform_selected(X, self._legacy_transform, self.dtype,
if not self.missing_values:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by if not self.missing_values

@@ -260,13 +272,15 @@ class OneHotEncoder(_BaseEncoder):

def __init__(self, n_values=None, categorical_features=None,
categories=None, sparse=True, dtype=np.float64,
handle_unknown='error'):
handle_unknown='error', missing_values=None, handle_missing=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make handle_missing='all-missing' the default

# present during transform.
oh = OneHotEncoder(handle_unknown='error', handle_missing='abcde')
oh.fit(X)
assert_raises(ValueError, oh.transform, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test that an appropriate error message is raised

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on the tests now.

@jorisvandenbossche jorisvandenbossche changed the title Handlemissing onehotencoder ENH: handle missing values in OneHotEncoder Sep 6, 2018
@jorisvandenbossche
Copy link
Member

@Olamyy Do you have time to update this PR?
To start with, actual tests for the new behaviours are needed.

@nilichen
Copy link
Contributor

take

@thomasjpfan
Copy link
Member

@nilichen There is a bunch of discussion surrounding this issue. Please look at #11996 for details. Especially look at #11996 (comment)

Since the OneHotEncoder can handle many dtypes as input (floats, strings, ints), handling missing values can get a little involved.

@nilichen
Copy link
Contributor
nilichen commented Mar 17, 2020

@thomasjpfan Thanks for the info! There seems to be some discussion around implementing handle_missing='all-missing' and it would actually be nice to have some clarification. According to @jnothman
I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky. However, @amueller and @ogrisel seem to have different opinions according to #11996 (comment) and #11996 (comment).
Wondering which direction I should go with.

EDIT: And if I understand correctly, impute first with constant value then OneHotEncoder should have the same behaviour as handle_missing='category'.

My view would be to have handle_missing='category' as default. I have been reading Statistical Rethinking 2nd recently where it discussed encoding categories in Chapter 5.3. Say we have female, male and missing, if encode missing as [0, 0] instead of [0, 0, 1], we are assuming there's more uncertainty among categories female and male than missing. The difference is subtle but something to consider. Taken from the draft (it used to be online and now it's take off, hope it's OK to share here):
image

@jnothman
Copy link
Member
jnothman commented Mar 17, 2020 via email

@nilichen
Copy link
Contributor
nilichen commented Mar 18, 2020 via email

Base automatically changed from master to main January 22, 2021 10:50
@thomasjpfan
Copy link
Member

Closing because this was fixed in #17317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle missing values in OneHotEncoder
6 participants
0