8000 ENH: handle missing values in OneHotEncoder by Olamyy · Pull Request #12025 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH: handle missing values in OneHotEncoder #12025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -540,6 +540,18 @@ columns for this feature will be all zeros
array([[1., 0., 0., 0., 0., 0.]])


Missing categorical features in the training data can be handled by specifying what happens to them using the ``handle_missing`` parameter. The values for this can be one of :

`all-missing`: This will replace all missing rows with NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of operational description belongs in the class docstring. Here you would focus on the benefits or use-cases of one or another approach.

`all-zero` : This will replace all missing rows with zeros.
`categorical` : This will replace all missing rows as a representation of a separate one hot column.

Note that, for scikit-learn to handle your missing values using OneHotEncoder,
you have to pass a placeholder of what should be recorded as a missing value.
This is the `missing_values` parameter and possible values can be either a
`NaN` or a custom value of your choice.


See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as scalars.

Expand Down
51 changes: 42 additions & 9 deletions sklearn/preprocessing/_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,8 @@
from .base import _transform_selected
from .label import _encode, _encode_check_unknown


range = six.moves.range


__all__ = [
'OneHotEncoder',
'OrdinalEncoder'
Expand Down Expand Up @@ -218,6 +216,21 @@ class OneHotEncoder(_BaseEncoder):
The ``n_values_`` attribute was deprecated in version
0.20 and will be removed in 0.22.

handle_missing : all-missing, all-zero or category
What should be done to missing values. Should be one of:

'all-missing':
Replace with a row of NaNs

'all-zero:
Replace with a row of zeros

'category:
Represent with a separate one-hot column

missing_values: NaN or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we would be best off supporting only NaN initially. Simplifies the code and the reviewing... and handling NaN and None properly is tricky.

+1

What should be considered as a missing value?

Examples
--------
Given a dataset with two features, we let the encoder find the unique
Expand Down Expand Up @@ -260,13 +273,15 @@ class OneHotEncoder(_BaseEncoder):

def __init__(self, n_values=None, categorical_features=None,
categories=None, sparse=True, dtype=np.float64,
handle_unknown='error'):
handle_unknown='error', missing_values="NaN", handle_missing="all-missing"):
self.categories = categories
self.sparse = sparse
self.dtype = dtype
self.handle_unknown = handle_unknown
self.n_values = n_values
self.categorical_features = categorical_features
self.missing_values = missing_values
self.handle_missing = handle_missing

# Deprecated attributes

Expand Down Expand Up @@ -567,12 +582,30 @@ def transform(self, X):
X_out : sparse matrix if sparse=True else a 2-d array
Transformed input.
"""
if self._legacy_mode:
return _transform_selected(X, self._legacy_transform, self.dtype,
self._categorical_features,
copy=True)

if not self.handle_missing or self.handle_missing not in ["all-missing",
"all-zero", "category"]:
raise ValueError("Wrong 'handle_missing' value specified. "
"'handle_missing' should be one of either "
"['all-missing', 'all-zero', 'category']. "
"Getting {0}".format(self.handle_missing))
missing_indices = np.argwhere(np.isnan(X)) if self.missing_values == "NaN" else \
np.argwhere(X == self.missing_values)
if self.handle_missing == "all-missing":
for i in missing_indices:
X[i] = np.nan
if self.handle_missing == "all-zero":
for i in missing_indices:
X[i] = 0
else:
return self._transform_new(X)
# Replace with a seperate one-hot column
pass

if self._legacy_mode:
return _transform_selected(X, self._legacy_transform,
self.dtype,
self._categorical_features, copy=True)
return self._transform_new(X)

def inverse_transform(self, X):
"""Convert the back data to the original representation.
Expand Down Expand Up @@ -659,7 +692,7 @@ def get_feature_names(self, input_features=None):
cats = self.categories_
if input_features is None:
input_features = ['x%d' % i for i in range(len(cats))]
elif(len(input_features) != len(self.categories_)):
elif (len(input_features) != len(self.categories_)):
raise ValueError(
"input_features should have length equal to number of "
"features ({}), got {}".format(len(self.categories_),
Expand Down
32 changes: 32 additions & 0 deletions sklearn/preprocessing/tests/test_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -576,3 +576,35 @@ def test_one_hot_encoder_warning():
def test_categorical_encoder_stub():
from sklearn.preprocessing import CategoricalEncoder
assert_raises(RuntimeError, CategoricalEncoder, encoding='ordinal')


def test_one_hot_encoder_invalid_handle_missing():
X = np.array([[0, 2, 1], [1, 0, 3], [1, 0, 2]])
y = np.array([[4, 1, 1]])
# Test that one hot encoder raises error for unknown features
# present during transform.
oh = OneHotEncoder(handle_unknown='error', handle_missing='abcde')
oh.fit(X)
assert_raises(ValueError, oh.transform, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test that an appropriate error message is raised

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on the tests now.



def test_one_hot_encoder_missing_values_none_handle_missing_passed():
X = np.array([[0, 2, 1], [1, 0, 3], [1, 0, 2]])
y = np.array([[4, 1, 1]])
# Test that one hot encoder raises error for unknown features
# present during transform.
oh = OneHotEncoder(handle_unknown='error', missing_values=None,handle_missing='abcde')
oh.fit(X)
assert_raises(ValueError, oh.transform, y)


def test_one_hot_encoder_handle_missing_all_zeros():
pass


def test_one_hot_encoder_handle_missing_all_missing():
pass


def test_one_hot_encoder_handle_missing_category():
pass
0