[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

qinhanmin2014 · 2017-08-30T12:18:23Z

Reference Issue

What does this implement/fix? Explain your changes.

add encoding option to KBinsDiscretizer
(1)encode option support {'onehot', 'onehot-dense', 'ordinal'}, the default value is set to 'ordinal' mainly because of (3)
(2)only one-hot encode discretized features when ignored_features is set.
(according to OneHotEncoder, non-categorical features are always stacked to the right of the matrix.)
(3)seems hard to support inverse_transform for one-hot because OneHotEncoder don't support inverse_transform

Any other comments?

jnothman · 2017-08-31T08:38:59Z

Thanks for this. Tests are failing, though

jnothman · 2017-08-31T08:40:59Z

Oh you mentioned that. A strange failure. Let me check what that branch is doing...

jnothman · 2017-08-31T08:46:47Z

So the test failures relate to a recent Cython release. Merging master into discrete should fix it. I'll do this shortly.

qinhanmin2014 · 2017-08-31T08:53:07Z

@jnothman Thanks. Kindly inform me when you finish. :)

jnothman · 2017-08-31T09:45:33Z

Try pushing an empty commit

qinhanmin2014 · 2017-08-31T14:04:40Z

@jnothman test failure unrelated. I believe it worth a review. Thanks :)

qinhanmin2014 · 2017-09-01T09:14:29Z

@jnothman Finally CIs are green.

ogrisel · 2017-09-01T12:00:59Z

sklearn/preprocessing/discretization.py

+            and return a sparse matrix.
+        onehot-dense:
+            encode the transformed result with one-hot encoding
+            and return a dense matrix.


dense array

ogrisel · 2017-09-01T12:02:57Z

sklearn/preprocessing/discretization.py

+            encode the transformed result with one-hot encoding
+            and return a dense matrix.
+        ordinal:
+            do not encode the transformed result.


Return the bin identifier encoded as an integer value.

qinhanmin2014 · 2017-09-01T13:01:22Z

@ogrisel Thanks. Comments addressed.

lesteve

Superficial review without looking in the heart of the code.

lesteve · 2017-09-01T11:26:11Z

sklearn/preprocessing/discretization.py

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
        """
        X = check_array(X, dtype='numeric')

+        valid_encode = ['onehot', 'onehot-dense', 'ordinal']
+        if self.encode not in valid_encode:
+            raise ValueError('Invalid encode value. '


Add the value provided by the user in the error message, i.e. something like this:

"Valid options for 'encode' are {}. Got 'encode={}' instead".format(sorted(valid_encode), encode)

lesteve · 2017-09-01T11:26:50Z

sklearn/preprocessing/discretization.py

+                                 retain_order=True)
+
+        # only one-hot encode discretized features
+        mask = np.array([True] * X.shape[1])


Probably more readable to use np.repeat(True, X.shape[1])

lesteve · 2017-09-01T13:24:07Z

sklearn/preprocessing/discretization.py

+        # don't support inverse_transform
+        if self.encode != 'ordinal':
+            raise ValueError("inverse_transform only support "
+                             "encode='ordinal'.")


Add the value of self.encode in the error message, e.g. . "Got {} instead"

lesteve · 2017-09-01T13:24:54Z

sklearn/preprocessing/discretization.py

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
        Column indices of ignored features. (Example: Categorical features.)
        If ``None``, all features will be discretized.

+    encode : string {'onehot', 'onehot-dense', 'ordinal'} (default='ordinal')


I don't thing you need the type information when you list all possible values. Double-check with the numpy doc style.

@lesteve It seems that this is not the case in many functions (e.g. pca, LinearSVC) and I have no idea how to check the doc style. Could you please help me? Thanks.

From https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#sections

When a parameter can only assume one of a fixed set of values, those values can be listed in braces, with the default appearing first: order : {'C', 'F', 'A'} Description of `order`.

lesteve · 2017-09-01T13:29:23Z

sklearn/preprocessing/tests/test_discretization.py

+    assert_raises(ValueError, est.inverse_transform, X)
+    est = KBinsDiscretizer(n_bins=[2, 3, 3, 3],
+                           encode='onehot').fit(X)
+    expected3 = est.transform(X)


Should you not check that the output is sparse in the onehot case?

Also probably check that the output is not sparse in the onehot-dense case.

qinhanmin2014 · 2017-09-01T14:26:49Z

@lesteve Comments addressed except for the first one. Thanks.

qinhanmin2014 · 2017-09-02T02:01:28Z

@lesteve Thanks. Comment addressed.

glemaitre · 2017-09-03T10:28:09Z

I find a bit fuzzy both naming 'onehot' and 'onehot-dense' since that this is not explicit in the naming that by default this is sparse. Would it make sense to call 'onehot' -> 'onehot-sparse'

glemaitre · 2017-09-03T10:39:43Z

Oh I see this is the same naming in the CategoricalEncoder PR. So that might be fine.

glemaitre

couple of suggestions. @jnothman this code will be probably changing using the CategoricalEncoder I supposed?

glemaitre · 2017-09-03T10:22:30Z

sklearn/preprocessing/discretization.py

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
        Column indices of ignored features. (Example: Categorical features.)
        If ``None``, all features will be discretized.

+    encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')


comma after }

glemaitre · 2017-09-03T10:22:47Z

sklearn/preprocessing/discretization.py

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
        Column indices of ignored features. (Example: Categorical features.)
        If ``None``, all features will be discretized.

+    encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
+        method used to encode the transformed result.


method -> Method

glemaitre · 2017-09-03T10:24:41Z

sklearn/preprocessing/discretization.py

+        method used to encode the transformed result.
+
+        onehot:
+            encode the transformed result with one-hot encoding


encode -> Encode

glemaitre · 2017-09-03T10:24:49Z

sklearn/preprocessing/discretization.py

+            encode the transformed result with one-hot encoding
+            and return a sparse matrix.
+        onehot-dense:
+            encode the transformed result with one-hot encoding


encode -> Encode

glemaitre · 2017-09-03T10:24:56Z

sklearn/preprocessing/discretization.py

+            encode the transformed result with one-hot encoding
+            and return a dense array.
+        ordinal:
+            return the bin identifier encoded as an integer value.


return -> Return

glemaitre · 2017-09-03T10:30:59Z

sklearn/preprocessing/discretization.py

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
        """
        X = check_array(X, dtype='numeric')

+        valid_encode = ['onehot', 'onehot-dense', 'ordinal']


you might want to use a tuple instead of a list.

glemaitre · 2017-09-03T10:31:36Z

sklearn/preprocessing/discretization.py

+        if self.encode not in valid_encode:
+            raise ValueError("Valid options for 'encode' are {}. "
+                             "Got 'encode = {}' instead."
+                             .format(sorted(valid_encode), self.encode))


This seems unnecessary to sort.

glemaitre · 2017-09-03T10:36:59Z

sklearn/preprocessing/discretization.py

+                                 retain_order=True)
+
+        # only one-hot encode discretized features
+        mask = np.repeat(True, X.shape[1])


It would be faster to use:

mask = np.ones(X.shape[1], dtype=bool)

glemaitre · 2017-09-03T10:43:08Z

sklearn/preprocessing/tests/test_discretization.py

@@ -174,3 +176,40 @@ def test_numeric_stability():
        X = X_init / 10**i
        Xt = KBinsDiscretizer(n_bins=2).fit_transform(X)
        assert_array_equal(Xt_expected, Xt)
+
+
+def test_encode():


I would probably split this test in several tests

jnothman · 2017-09-03T11:07:28Z

Maybe this will land up using the categorical encoder... but it doesn't need to. For example, unary encoding doesn't make sense for categories, in general, but it does for ordinals / ordered categories, such as those produced by a dicretizer. That is, the discretizer isn't producing categorical features, it's producing ordinal features.

…

On 3 September 2017 at 20:44, Guillaume Lemaitre ***@***.***> wrote: ***@***.**** commented on this pull request. couple of suggestions. @jnothman <https://github.com/jnothman> this code will be probably changing using the CategoricalEncoder I supposed? ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): Column indices of ignored features. (Example: Categorical features.) If ``None``, all features will be discretized. + encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') comma after } ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): Column indices of ignored features. (Example: Categorical features.) If ``None``, all features will be discretized. + encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') + method used to encode the transformed result. method -> Method ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): Column indices of ignored features. (Example: Categorical features.) If ``None``, all features will be discretized. + encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') + method used to encode the transformed result. + + onehot: + encode the transformed result with one-hot encoding encode -> Encode ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): Column indices of ignored features. (Example: Categorical features.) If ``None``, all features will be discretized. + encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') + method used to encode the transformed result. + + onehot: + encode the transformed result with one-hot encoding + and return a sparse matrix. + onehot-dense: + encode the transformed result with one-hot encoding encode -> Encode ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin): Column indices of ignored features. (Example: Categorical features.) If ``None``, all features will be discretized. + encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal') + method used to encode the transformed result. + + onehot: + encode the transformed result with one-hot encoding + and return a sparse matrix. + onehot-dense: + encode the transformed result with one-hot encoding + and return a dense array. + ordinal: + return the bin identifier encoded as an integer value. return -> Return ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -114,6 +128,12 @@ def fit(self, X, y=None): """ X = check_array(X, dtype='numeric') + valid_encode = ['onehot', 'onehot-dense', 'ordinal'] you might want to use a tuple instead of a list. ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -114,6 +128,12 @@ def fit(self, X, y=None): """ X = check_array(X, dtype='numeric') + valid_encode = ['onehot', 'onehot-dense', 'ordinal'] + if self.encode not in valid_encode: + raise ValueError("Valid options for 'encode' are {}. " + "Got 'encode = {}' instead." + .format(sorted(valid_encode), self.encode)) This seems unnecessary to sort. ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > Data in the binned space. """ check_is_fitted(self, ["offset_", "bin_width_"]) X = self._validate_X_post_fit(X) - return _transform_selected(X, self._transform, - self.transformed_features_, copy=True, - retain_order=True) + Xt = _transform_selected(X, self._transform, + self.transformed_features_, copy=True, + retain_order=True) + + # only one-hot encode discretized features + mask = np.repeat(True, X.shape[1]) It would be faster to use: mask = np.ones(X.shape[1], dtype=bool) ------------------------------ In sklearn/preprocessing/tests/test_discretization.py <#9647 (comment)> : > @@ -174,3 +176,40 @@ def test_numeric_stability(): X = X_init / 10**i Xt = KBinsDiscretizer(n_bins=2).fit_transform(X) assert_array_equal(Xt_expected, Xt) + + +def test_encode(): I would probably split this test in several tests — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9647 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_ICm3txOQTJwatrzj4NzXqUUczpks5seoLwgaJpZM4PHS2g> .

qinhanmin2014 · 2017-09-03T12:02:06Z

@glemaitre Thanks. Comments addressed.

jnothman · 2017-09-04T13:33:08Z

doc/modules/preprocessing.rst

@@ -459,6 +459,8 @@ K-bins discretization
  >>> est.bin_width_
  array([ 3.,  1.,  2.])

+By default the output is one-hot encoded into a sparse matrix (See :class:`OneHotEncoder`)


I'd rather this referred to a section of the user guide which described what this means rather than provides a (in this context irrelevant) tool to do it.

@jnothman The only place I can think of is http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features. Could you please provide more specific suggestion? Thanks.

That seems the right place to point to.

qinhanmin2014 · 2017-09-04T14:49:21Z

@jnothman I changed the link to Encoding categorical features Now it looks like this:

Further suggestions welcome :)

hlin117

Overall looks great! Nice work.

hlin117 · 2017-09-05T00:40:36Z

sklearn/preprocessing/tests/test_discretization.py

@@ -3,8 +3,10 @@
 import numpy as np
 from six.moves import range
 import warnings
+import scipy.sparse as sp


Nit: Move this above six.moves, so it is alphabetical.

In fact you should use sklearn.externals.six to not have dependencies on six

hlin117 · 2017-09-05T00:41:18Z

sklearn/preprocessing/discretization.py

@@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
            Data in the original feature space.
        """
        check_is_fitted(self, ["offset_", "bin_width_"])
+
+        # currently, preprocessing.OneHotEncoder
+        # don't support inverse_transform


don't -> doesn't

hlin117 · 2017-09-05T00:56:56Z

sklearn/preprocessing/discretization.py

+        if self.ignored_features is not None:
+            mask[self.ignored_features] = False
+
+        if self.encode == 'onehot':


Nit: You can make this more concise.

if self.encode == 'ordinal': return Xt # Only one-hot encode discretized features mask = np.ones(X.shape[1], dtype=bool) if self.ignored_features is not None: mask[self.ignored_features] = False encode_sparse = (self.encode == 'onehot') return OneHotEncoder(n_values=np.array(self.n_bins_)[mask], categorical_features='all' if self.ignored_features is None else mask, sparse=encode_sparse).fit_transform(Xt)

hlin117 · 2017-09-05T00:59:45Z

sklearn/preprocessing/discretization.py

@@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
            Data in the original feature space.
        """
        check_is_fitted(self, ["offset_", "bin_width_"])
+
+        # currently, preprocessing.OneHotEncoder


Nit: The convention of this file would be to capitalize comments. ("currently" -> "Currently")

hlin117 · 2017-09-05T01:00:12Z

sklearn/preprocessing/discretization.py

+        # currently, preprocessing.OneHotEncoder
+        # don't support inverse_transform
+        if self.encode != 'ordinal':
+            raise ValueError("inverse_transform only support "


support -> supports

hlin117 · 2017-09-05T01:02:29Z

sklearn/preprocessing/tests/test_discretization.py

+    est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2],
+                           encode='onehot-dense').fit(X)
+    Xt_1 = est.transform(X)
+    Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4],


Nit: You don't need to have Xt_2 be an array. A nested list should suffice.

I would also rename Xt_2 -> Xt_expected and Xt_1 -> Xt.

hlin117 · 2017-09-05T01:03:21Z

sklearn/preprocessing/tests/test_discretization.py

+                     [0, 1, 0, 1, 0, 0, 2.5, -3],
+                     [0, 0, 1, 0, 1, 0, 3.5, -2],
+                     [0, 0, 1, 0, 0, 1, 4.5, -1]])
+    assert_array_equal(Xt_1, Xt_2)


Conventions are expected parameter first, and then actual.

jnothman · 2017-09-05T01:18:47Z

Thanks for helping review, @hlin117!

…

On 5 September 2017 at 11:03, Henry Lin ***@***.***> wrote: ***@***.**** commented on this pull request. Overall looks great! Nice work. ------------------------------ In sklearn/preprocessing/tests/test_discretization.py <#9647 (comment)> : > @@ -3,8 +3,10 @@ import numpy as np from six.moves import range import warnings +import scipy.sparse as sp Nit: Move this above six.moves, so it is alphabetical. ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -259,6 +299,15 @@ def inverse_transform(self, Xt): Data in the original feature space. """ check_is_fitted(self, ["offset_", "bin_width_"]) + + # currently, preprocessing.OneHotEncoder + # don't support inverse_transform don't -> doesn't ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > Data in the binned space. """ check_is_fitted(self, ["offset_", "bin_width_"]) X = self._validate_X_post_fit(X) - return _transform_selected(X, self._transform, - self.transformed_features_, copy=True, - retain_order=True) + Xt = _transform_selected(X, self._transform, + self.transformed_features_, copy=True, + retain_order=True) + + # only one-hot encode discretized features + mask = np.ones(X.shape[1], dtype=bool) + if self.ignored_features is not None: + mask[self.ignored_features] = False + + if self.encode == 'onehot': Nit: You can make this more concise. if self.encode == 'ordinal': return Xt # Only one-hot encode discretized features mask = np.ones(X.shape[1], dtype=bool) if self.ignored_features is not None: mask[self.ignored_features] = False encode_sparse = (self.encode == 'onehot') return OneHotEncoder(n_values=np.array(self.n_bins_)[mask], categorical_features='all' if self.ignored_features is None else mask, sparse=encode_sparse).fit_transform(Xt) ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -259,6 +299,15 @@ def inverse_transform(self, Xt): Data in the original feature space. """ check_is_fitted(self, ["offset_", "bin_width_"]) + + # currently, preprocessing.OneHotEncoder Nit: The convention of this file would be to capitalize comments. ("currently" -> "Currently") ------------------------------ In sklearn/preprocessing/discretization.py <#9647 (comment)> : > @@ -259,6 +299,15 @@ def inverse_transform(self, Xt): Data in the original feature space. """ check_is_fitted(self, ["offset_", "bin_width_"]) + + # currently, preprocessing.OneHotEncoder + # don't support inverse_transform + if self.encode != 'ordinal': + raise ValueError("inverse_transform only support " support -> supports ------------------------------ In sklearn/preprocessing/tests/test_discretization.py <#9647 (comment)> : > + assert_raises(ValueError, est.inverse_transform, Xt_2) + est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], + encode='onehot').fit(X) + Xt_3 = est.transform(X) + assert sp.issparse(Xt_3) + assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True) + .fit_transform(Xt_1).toarray(), + Xt_3.toarray()) + assert_raises(ValueError, est.inverse_transform, Xt_3) + + +def test_one_hot_encode_with_ignored_features(): + est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2], + encode='onehot-dense').fit(X) + Xt_1 = est.transform(X) + Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4], Nit: You don't need to have Xt_2 be an array. A nested list should suffice. I would also rename Xt_2 -> Xt_expected and Xt_1 -> Xt. ------------------------------ In sklearn/preprocessing/tests/test_discretization.py <#9647 (comment)> : > + assert sp.issparse(Xt_3) + assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True) + .fit_transform(Xt_1).toarray(), + Xt_3.toarray()) + assert_raises(ValueError, est.inverse_transform, Xt_3) + + +def test_one_hot_encode_with_ignored_features(): + est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2], + encode='onehot-dense').fit(X) + Xt_1 = est.transform(X) + Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4], + [0, 1, 0, 1, 0, 0, 2.5, -3], + [0, 0, 1, 0, 1, 0, 3.5, -2], + [0, 0, 1, 0, 0, 1, 4.5, -1]]) + assert_array_equal(Xt_1, Xt_2) Conventions are expected parameter first, and then actual. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9647 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67SYKYgmWhEIQUVJb8W4gMb92VD_ks5sfJ39gaJpZM4PHS2g> .

qinhanmin2014 · 2017-09-05T01:40:41Z

Thanks @hlin117. Comments addressed.

hlin117

Looks good!

qinhanmin2014 · 2017-09-05T07:12:07Z

@glemaitre Thanks. Comments addressed.

glemaitre · 2017-09-05T14:07:15Z

I made a small pass on it. I have to check the artifacts but I think that I am LGTM.

glemaitre · 2017-09-05T11:18:57Z

sklearn/preprocessing/discretization.py

+        if self.ignored_features is not None:
+            mask[self.ignored_features] = False
+
+        encode_sparse = (self.encode == 'onehot')


I would remove the paranthesis

glemaitre · 2017-09-05T13:57:51Z

sklearn/preprocessing/discretization.py

+            mask[self.ignored_features] = False
+
+        encode_sparse = (self.encode == 'onehot')
+        return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],


self.n_bins_ is already an array, isn't it? I don't think that you need to pass it inside np.array

@glemaitre
Thanks. According to the original author, we allow users to pass a list as n_bins_. If we don't do the conversion, we get an error(list indices must be integers).

glemaitre · 2017-09-05T13:58:43Z

sklearn/preprocessing/discretization.py

@@ -259,6 +295,15 @@ def inverse_transform(self, Xt):
            Data in the original feature space.
        """
        check_is_fitted(self, ["offset_", "bin_width_"])
+
+        # Currently, preprocessing.OneHotEncoder


if you can make your line up to 79 characters, that would be great.

glemaitre · 2017-09-05T14:03:29Z

sklearn/preprocessing/tests/test_discretization.py

+
+def test_invalid_encode_option():
+    est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], encode='invalid-encode')
+    assert_raises(ValueError, est.fit, X)


Could you match the error message (or part of it) using assert_raises_regex

glemaitre · 2017-09-05T14:04:26Z

sklearn/preprocessing/tests/test_discretization.py

+    assert not sp.issparse(Xt_2)
+    assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=False)
+                       .fit_transform(Xt_1), Xt_2)
+    assert_raises(ValueError, est.inverse_transform, Xt_2)


Could you match the error message (or part of it) using assert_raises_regex

In addition I would group all the error inside a side test.

glemaitre · 2017-09-05T14:04:38Z

sklearn/preprocessing/tests/test_discretization.py

+
+
+def test_encode_options():
+    # test valid encode options through comparison


remove this comment

glemaitre · 2017-09-05T14:04:55Z

sklearn/preprocessing/tests/test_discretization.py

+    assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True)
+                       .fit_transform(Xt_1).toarray(),
+                       Xt_3.toarray())
+    assert_raises(ValueError, est.inverse_transform, Xt_3)


Could you match the error message (or part of it) using assert_raises_regex

glemaitre · 2017-09-05T14:42:51Z

I don't have computer right now. But I thought to spot that validate_n_bins was converting self.n_bins (which can be a list or int or array) to self.nbins_ which is a numpy array. Can you check that this is the case?

qinhanmin2014 · 2017-09-05T15:00:22Z

@glemaitre Thanks a lot. I remove np.array and all the comments are addressed. Sorry @jnothman for not realizing your right suggestions previously.

qinhanmin2014 · 2017-09-07T12:51:38Z

Any update on it? Thanks :)

glemaitre · 2017-09-07T13:06:01Z

I checked the artifact. Can you add an entry in the documentation API to link to the User Guide entry.

For example:
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/ensemble/forest.py#L751

glemaitre · 2017-09-07T13:09:05Z

LGTM to be merged to discrete I assume that an example will be added later on since this PR is only for the encoding option.

cc: @lesteve @jnothman @ogrisel

codecov · 2017-09-07T13:20:08Z

Codecov Report

Merging #9647 into discrete will decrease coverage by <.01%.
The diff coverage is 100%.

@@             Coverage Diff              @@
##           discrete    #9647      +/-   ##
============================================
- Coverage     96.18%   96.18%   -0.01%     
============================================
  Files           336      338       +2     
  Lines         63835    62408    -1427     
============================================
- Hits          61402    60027    -1375     
+ Misses         2433     2381      -52

Impacted Files	Coverage Δ
sklearn/preprocessing/discretization.py	`100% <100%> (ø)`	⬆️
sklearn/preprocessing/tests/test_discretization.py	`100% <100%> (ø)`	⬆️
sklearn/datasets/olivetti_faces.py	`33.33% <0%> (-8.89%)`	⬇️
sklearn/datasets/base.py	`85.5% <0%> (-6.58%)`	⬇️
sklearn/datasets/california_housing.py	`39.47% <0%> (-4.12%)`	⬇️
sklearn/datasets/covtype.py	`52.17% <0%> (-4.08%)`	⬇️
sklearn/tests/test_docstring_parameters.py	`37.5% <0%> (-3.84%)`	⬇️
sklearn/utils/tests/test_testing.py	`81% <0%> (-3.68%)`	⬇️
sklearn/datasets/lfw.py	`12.41% <0%> (-2.4%)`	⬇️
sklearn/datasets/kddcup99.py	`32.72% <0%> (-2.14%)`	⬇️
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ef342e...a4e9722. Read the comment docs.

qinhanmin2014 · 2017-09-07T13:45:34Z

@glemaitre Sorry to disturb. Seems that the link to the user guide doesn't work and I cannot figure out the reason.

jnothman · 2017-09-07T13:48:14Z

I think that's sufficient consensus...

jnothman · 2017-09-07T13:49:07Z

Thanks @qinhanmin2014

qinhanmin2014 · 2017-09-07T13:51:53Z

@jnothman seems that the link to the user guide doesn't work and I cannot find the reason. Could you please help me? Thanks a lot :)

glemaitre · 2017-09-07T15:28:39Z

I put the fix in #9705. I think that discretization was duplicated with the next sentence. It seems that this is not case sensitive. I build locally and it was fine after renaming it.

qinhanmin2014 · 2017-09-07T15:37:36Z

@glemaitre Thanks a lot for the fix :)
The example is currently being discussed in #9339 as well as the mailing list.

qinhanmin2014 added 4 commits August 30, 2017 20:03

encode option

06fb725

minor fix

ac4c435

pep8 fix

b9cc580

minor typo

46dbaa9

qinhanmin2014 mentioned this pull request Aug 30, 2017

discrete branch: add encoding option to discretizer #9336

Closed

minor fix

5a062b3

qinhanmin2014 changed the title ~~[WIP] discrete branch: add encoding option to KBinsDiscretizer~~ [MRG] discrete branch: add encoding option to KBinsDiscretizer Aug 31, 2017

pass the test

a6db496

ogrisel reviewed Sep 1, 2017

View reviewed changes

improve doc

ee8a754

lesteve reviewed Sep 1, 2017

View reviewed changes

improve

45f37e6

doc update

e33537f

glemaitre reviewed Sep 3, 2017

View reviewed changes

improve

b91beef

jnothman reviewed Sep 4, 2017

View reviewed changes

change ref

93131a3

hlin117 reviewed Sep 5, 2017

View reviewed changes

improve

d8e198d

hlin117 approved these changes Sep 5, 2017

View reviewed changes

improve

0240db7

glemaitre approved these changes Sep 5, 2017

View reviewed changes

qinhanmin2014 added 2 commits September 5, 2017 22:45

improve

ae67e14

remove np.array

eb33945

add link

a4e9722

jnothman merged commit eef7bdb into scikit-learn:discrete Sep 7, 2017

glemaitre mentioned this pull request Sep 7, 2017

[MRG] DOC fix link to user guide #9705

Merged

qinhanmin2014 deleted the my-feature-1 branch September 7, 2017 15:38



		def test_encode_options():
		# test valid encode options through comparison

Uh oh!

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

Uh oh!

Conversation

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding t 8000 his comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!