8000 ENH: handle missing values in OneHotEncoder by Olamyy · Pull Request #12025 · scikit-learn/scikit-learn · GitHub

ENH: handle missing values in OneHotEncoder #12025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

Olamyy wants to merge 6 commits into scikit-learn:main from Olamyy:handlemissing-onehotencoder

Olamyy commented

Reference Issues/PRs

Fixes #11996

What does this implement/fix? Explain your changes.

Currently contains 3 edits:

Tests to check handle_missing and missing_values are passed and have correct values
An update to the OneHotEncoder docstring and the preprocessing module.
An initial implementation logic of the required features as stated by Handle missing values in OneHotEncoder #11996

olamilekan added 2 commits

September 6, 2018 10:59


          Docstring Update, Modules docs update and initial logic implementation

9b92970


          Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

e14bd48

Olamyy mentioned this pull request

Fixes #11996 : Handle missing values in OneHotEncoder #12017

Closed

jnothman reviewed

View reviewed changes

doc/modules/preprocessing.rst Show resolved Hide resolved

doc/modules/preprocessing.rst

@@ @@ -540,6 +540,15 @@ columns for this feature will be all zeros @@
                   array([[1., 0., 0., 0., 0., 0.]])
+              Missing categorical features in the training data can be handled by specifying what happens to them using the ``handle_missing`` parameter. The values for this can be one of :
+              `all-missing`: This will replace all missing rows with NaN.

Member

jnothman

This kind of operational description belongs in the class docstring. Here you would focus on the benefits or use-cases of one or another approach.

sklearn/preprocessing/_encoders.py Outdated

+                  handle_missing : all-missing, all-zero or category
+                      What should be done to missing values. Should be one of:
+                      all-missing: Replace with a row of NaNs as above

Member

jnothman

For better rendering, use restructured text definition list:

all-missing
    Replace with a row of NaNs as above
all-zero
    Replace with a row of zeros

What do you mean by "as above"

Author

Olamyy

Fixed in the last commit.

as above is a typo.

sklearn/preprocessing/_encoders.py Show resolved Hide resolved

sklearn/preprocessing/_encoders.py


		category: Represent with a separate one-hot column

		missing_values: NaN or None

Member

jnothman

I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.

Member

jorisvandenbossche

I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.

+1

sklearn/preprocessing/_encoders.py Outdated

-                          return _transform_selected(X, self._legacy_transform, self.dtype,
+                      if not self.missing_values:
+                          if self._legacy_mode:
+                              return _transform_selected(X, self._legacy_transform, self.dtype,
                                                      self._categorical_features,

Member

jnothman

Incorrect indentation

sklearn/preprocessing/_encoders.py Outdated

@@ @@ -567,12 +581,30 @@ def transform(self, X): @@
                       X_out : sparse matrix if sparse=True else a 2-d array
                           Transformed input.
                       """
-                      if self._legacy_mode:
-                          return _transform_selected(X, self._legacy_transform, self.dtype,
+                      if not self.missing_values:

Member

jnothman

I'm not sure what you mean by if not self.missing_values

sklearn/preprocessing/_encoders.py Outdated

@@ @@ -260,13 +272,15 @@ class OneHotEncoder(_BaseEncoder): @@
                   def __init__(self, n_values=None, categorical_features=None,
                                categories=None, sparse=True, dtype=np.float64,
-                               handle_unknown='error'):
+                               handle_unknown='error', missing_values=None, handle_missing=None):

Member

jnothman

I think we can make handle_missing='all-missing' the default

sklearn/preprocessing/tests/test_encoders.py

+                  # present during transform.
+                  oh = OneHotEncoder(handle_unknown='error', handle_missing='abcde')
+                  oh.fit(X)
+                  assert_raises(ValueError, oh.transform, y)

Member

jnothman

please test that an appropriate error message is raised

Author

Olamyy

Working on the tests now.

jorisvandenbossche changed the title ~~Handlemissing onehotencoder~~ ENH: handle missing values in OneHotEncoder

olamilekan added 4 commits

September 8, 2018 10:45


          Fixed line length and updated docstring

91e0b43


          Fixed line length and updated docstring

3fa635a


          Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

6f89785

…into handlemissing-onehotencoder


          Updated logic

9ed9b7d

Member

jorisvandenbossche commented

@Olamyy Do you have time to update this PR?
To start with, actual tests for the new behaviours are needed.

jnothman mentioned this pull request

Handle missing values in OneHotEncoder #11996

Closed

jnothman added Stalled help wanted labels

amueller added the Needs work label

github-actions bot added the module:preprocessing label

Contributor

nilichen commented

take

github-actions bot assigned nilichen

github-actions bot removed the help wanted label

Member

thomasjpfan commented

@nilichen There is a bunch of discussion surrounding this issue. Please look at #11996 for details. Especially look at #11996 (comment)

Since the OneHotEncoder can handle many dtypes as input (floats, strings, ints), handling missing values can get a little involved.

Contributor

nilichen commented

•

@thomasjpfan Thanks for the info! There seems to be some discussion around implementing handle_missing='all-missing' and it would actually be nice to have some clarification. According to @jnothman
I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky. However, @amueller and @ogrisel seem to have different opinions according to #11996 (comment) and #11996 (comment).
Wondering which direction I should go with.

EDIT: And if I understand correctly, impute first with constant value then OneHotEncoder should have the same behaviour as handle_missing='category'.

My view would be to have handle_missing='category' as default. I have been reading Statistical Rethinking 2nd recently where it discussed encoding categories in Chapter 5.3. Say we have female, male and missing, if encode missing as [0, 0] instead of [0, 0, 1], we are assuming there's more uncertainty among categories female and male than missing. The difference is subtle but something to consider. Taken from the draft (it used to be online and now it's take off, hope it's OK to share here):

Member

jnothman commented

via email

To be clear, I was referring to only supporting NaN as a representation of missing values in the input to OneHotEncoder. @amueller and @ogrisel's comments regard the *output* of OneHotEncoder.transform, saying that handle_missing='all-missing' is not a useful option. I think we're all happy to exclude that option.

Contributor

nilichen commented

via email

That makes much more sense. Thank you for the clarification!

On Tue, Mar 17, 2020 at 4:06 PM Joel Nothman ***@***.***> wrote: To be clear, I was referring to only supporting NaN as a representation of missing values in the input to OneHotEncoder. @amueller and @ogrisel's comments regard the *output* of OneHotEncoder.transform, saying that handle_missing='all-missing' is not a useful option. I think we're all happy to exclude that option. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12025 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACFAWH5PMIFV2UTY5XCZ5B3RH766LANCNFSM4FTSR4GQ> .

nilichen mentioned this pull request

[WIP] Handle NaNs in OneHotEncoder #16749

Closed

Base automatically changed from master to main

January 22, 2021 10:50

Member

thomasjpfan commented

Closing because this was fixed in #17317

thomasjpfan closed this

woodly0 mentioned this pull request

What happend to the idea of adding a 'handle_missing' parameter to the OneHotEncoder? #26543

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:preprocessing Needs work Stalled

0