8000 [MRG+1] discrete branch: add encoding option to KBinsDiscretizer by qinhanmin2014 · Pull Request #9647 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Sep 7, 2017
Merged

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

merged 20 commits into from
Sep 7, 2017

Conversation

qinhanmin2014
Copy link
Member
@qinhanmin2014 qinhanmin2014 commented Aug 30, 2017

Reference Issue

Fixes #9336

What does this implement/fix? Explain your changes.

add encoding option to KBinsDiscretizer
(1)encode option support {'onehot', 'onehot-dense', 'ordinal'}, the default value is set to 'ordinal' mainly because of (3)
(2)only one-hot encode discretized features when ignored_features is set.
(according to OneHotEncoder, non-categorical features are always stacked to the right of the matrix.)
(3)seems hard to support inverse_transform for one-hot because OneHotEncoder don't support inverse_transform

Any other comments?

@jnothman
Copy link
Member

Thanks for this. Tests are failing, though

@jnothman
Copy link
Member

Oh you mentioned that. A strange failure. Let me check what that branch is doing...

@jnothman
Copy link
Member

So the test failures relate to a recent Cython release. Merging master into discrete should fix it. I'll do this shortly.

@qinhanmin2014
Copy link
Member Author

@jnothman Thanks. Kindly inform me when you finish. :)

@jnothman
Copy link
Member

Try pushing an empty commit

@qinhanmin2014 qinhanmin2014 changed the title [WIP] discrete branch: add encoding option to KBinsDiscretizer [MRG] discrete branch: add encoding option to KBinsDiscretizer Aug 31, 2017
@qinhanmin2014
Copy link
Member Author

@jnothman test failure unrelated. I believe it worth a review. Thanks :)

@qinhanmin2014
Copy link
Member Author

@jnothman Finally CIs are green.

and return a sparse matrix.
onehot-dense:
encode the transformed result with one-hot encoding
and return a dense matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dense array

encode the transformed result with one-hot encoding
and return a dense matrix.
ordinal:
do not encode the transformed result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the bin identifier encoded as an integer value.

@qinhanmin2014
Copy link
Member Author

@ogrisel Thanks. Comments addressed.

Copy link
Member
@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superficial review without looking in the heart of the code.

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')

valid_encode = ['onehot', 'onehot-dense', 'ordinal']
if self.encode not in valid_encode:
raise ValueError('Invalid encode value. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the value provided by the user in the error message, i.e. something like this:

"Valid options for 'encode' are {}. Got 'encode={}' instead".format(sorted(valid_encode), encode)

retain_order=True)

# only one-hot encode discretized features
mask = np.array([True] * X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more readable to use np.repeat(True, X.shape[1])

# don't support inverse_transform
if self.encode != 'ordinal':
raise ValueError("inverse_transform only support "
"encode='ordinal'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the value of self.encode in the error message, e.g. . "Got {} instead"

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.

encode : string {'onehot', 'onehot-dense', 'ordinal'} (default='ordinal')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't thing you need the type information when you list all possible values. Double-check with the numpy doc style.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve It seems that this is not the case in many functions (e.g. pca, LinearSVC) and I have no idea how to check the doc style. Could you please help me? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#sections

When a parameter can only assume one of a fixed set of values, those values can be listed in braces, with the default appearing first:

order : {'C', 'F', 'A'}
    Description of `order`.

assert_raises(ValueError, est.inverse_transform, X)
est = KBinsDiscretizer(n_bins=[2, 3, 3, 3],
encode='onehot').fit(X)
expected3 = est.transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you not check that the output is sparse in the onehot case?

Also probably check that the output is not sparse in the onehot-dense case.

@qinhanmin2014
Copy link
Member Author

@lesteve Comments addressed except for the first one. Thanks.

@qinhanmin2014
Copy link
Member Author

@lesteve Thanks. Comment addressed.

@glemaitre
Copy link
Member

I find a bit fuzzy both naming 'onehot' and 'onehot-dense' since that this is not explicit in the naming that by default this is sparse. Would it make sense to call 'onehot' -> 'onehot-sparse'

@glemaitre
Copy link
Member

Oh I see this is the same naming in the CategoricalEncoder PR. So that might be fine.

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding t 8000 his comment

The reason will be displayed to describe this comment to others. Learn more.

couple of suggestions. @jnothman this code will be probably changing using the CategoricalEncoder I supposed?

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.

encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after }

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.

encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
method used to encode the transformed result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method -> Method

method used to encode the transformed result.

onehot:
encode the transformed result with one-hot encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode -> Encode

encode the transformed result with one-hot encoding
and return a sparse matrix.
onehot-dense:
encode the transformed result with one-hot encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode -> Encode

encode the transformed result with one-hot encoding
and return a dense array.
ordinal:
return the bin identifier encoded as an integer value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return -> Return

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')

valid_encode = ['onehot', 'onehot-dense', 'ordinal']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to use a tuple instead of a list.

if self.encode not in valid_encode:
raise ValueError("Valid options for 'encode' are {}. "
"Got 'encode = {}' instead."
.format(sorted(valid_encode), self.encode))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary to sort.

retain_order=True)

# only one-hot encode discretized features
mask = np.repeat(True, X.shape[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be faster to use:

mask = np.ones(X.shape[1], dtype=bool)

@@ -174,3 +176,40 @@ def test_numeric_stability():
X = X_init / 10**i
Xt = KBinsDiscretizer(n_bins=2).fit_transform(X)
assert_array_equal(Xt_expected, Xt)


def test_encode():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably split this test in several tests

@jnothman
Copy link
Member
jnothman commented Sep 3, 2017 via email

@qinhanmin2014
Copy link
Member Author

@glemaitre Thanks. Comments addressed.

@@ -459,6 +459,8 @@ K-bins discretization
>>> est.bin_width_
array([ 3., 1., 2.])

By default the output is one-hot encoded into a sparse matrix (See :class:`OneHotEncoder`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather this referred to a section of the user guide which described what this means rather than provides a (in this context irrelevant) tool to do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman The only place I can think of is http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features. Could you please provide more specific suggestion? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems the right place to point to.

@qinhanmin2014
Copy link
Member Author

@jnothman I changed the link to Encoding categorical features Now it looks like this:
2017-09-04_224719
Further suggestions welcome :)

Copy link
Contributor
@hlin117 hlin117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Nice work.