[MRG] Refactored OneHotEncoder #6602

vighneshbirodkar · 2016-03-30T02:35:22Z

New Features in `OneHotEncoder`

Support for strings, negative integers ( and anything that can be put in an object array )
Specify discrete values using the classes parameter instead of n_values.
Uses a LabelEncoder instance for each column.

Changes

Changes _transform_selected to _apply_selected giving it the ability to optionally not return transformed values.
_apply_selected can no longer accept lists. It has to be given a np.array object. This is done because the input can be of np.object type and it cannot be always cast as a whole to np.int or np.float type. The transformed and non-transformed parts of the array are converted to the specified type before returning.
Instead of raising an error only when the feature in a column goes outside the max value, this change enables raising an error when any previously unknown value is seen

vighneshbirodkar · 2016-03-30T02:36:58Z

@jnothman Please have a look at this.

vighneshbirodkar · 2016-03-30T02:42:42Z

sklearn/preprocessing/tests/test_data.py

 def test_one_hot_encoder_categorical_features():
    X = np.array([[3, 2, 1], [0, 1, 1]])
-    X2 = np.array([[1, 1, 1]])
+    X2 = np.array([[3, 1, 1]])


This is the only potentially breaking behavior introduced. But not knowing if a value is in range but still unknown was a drawback of the original implementation.

I don't get what you're saying has changed. Please be explicit; the reviewers are not right now as intimately familiar with this code as you are.

Consider the following code

from sklearn.preprocessing import OneHotEncoder import numpy as np enc = OneHotEncoder(handle_unknown='error') X = np.array([[3], [5], [7]]) enc.fit(X) print(enc.transform([[4]]).toarray())

Output on master

[[ 0. 0. 0.]]

Output on this branch

Traceback (most recent call last): File "test.py", line 9, in <module> print(enc.transform([[4]]).toarray()) File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1845, in transform selected=self.categorical_features) File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1657, in _apply_selected return transform(X) File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1882, in _transform raise ValueError(msg) ValueError: Unknown feature(s) [4] in column 0

The current implementation in master any value of a feature less than or equal to its maximum value during fit is acceptable when n_values="auto". With the above code run on the master branch, any value between [0,7] will not throw an error, in spite of the fact that it was never seen during fit.

To support strings I have to make the assumption that any value of a feature supplied during transform should either have been seen during fit, or specified explicitly using classes/values argument. This will break existing code like the one I have written above.

This should be documented as a bug-fix then.

That is when n_values is set to "auto", having categories in transform that are not present during training will raise an error.

It wasn't really a bug. It was behaving as it was documented. But we cannot keep that behavior if we choose to support strings.

Ok. It is clear to me what you expect. What do you think is the best way to go about this ? And an if/else clause and revert to the old logic when the array is int ? Or keep processing the array as object type and emulate the old functionality by using np.arange arrays.

I'm not entirely sure what you're asking, and think you should go write some test and implement something that works. If you want, ask for feedback after the tests are written that the tested functionality is correct, before you implement it. Once you've got an implementation we can talk about whether there's a better approach.

@vighneshbirodkar Sorry, but I don't completely understand. From your original code snippet, you have explicitly set handle_unknown=error. How is not raising an error when you have a unknown sample during transform and have already set "handle_unknown=error" documented behavior?

Oh yes, the previous behaviour was to look at the range and not the actual values. Hmm.

What was the consensus here (#5270), I forgot. To just error?

amueller · 2016-04-14T21:02:27Z

btw, we should look at http://wdm0006.github.io/categorical_encoding/ some time

vighneshbirodkar · 2016-05-02T20:46:57Z

@MechCoder The following snippet illustrated why X[:, sel] and X[:, ind[sel]] are different

>>> import numpy as np
>>> np.__version__
'1.7.1'
>>> import scipy
>>> scipy.__version__
'0.11.0'
>>> from scipy import sparse
>>> X = np.arange(16).reshape(4, 4)
>>> X = sparse.csr_matrix(X)
>>> sel = np.array([True, False, True, False], dtype=np.bool)
>>> ind = np.arange(4)
>>> ind = ind.astype(np.int)
>>> X[:, sel].toarray()
array([[ 1,  0,  1,  0],
       [ 5,  4,  5,  4],
       [ 9,  8,  9,  8],
       [13, 12, 13, 12]])
>>> X[:, ind[sel]].toarray()
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10],
       [12, 14]])

vighneshbirodkar · 2016-05-02T20:50:57Z

@MechCoder @jnothman I believe I have addressed all your comments

jnothman · 2016-05-03T03:09:17Z

I'm offline atm with a cached page and can't look at your code for what you mean by a copy; but I suspect that a copy argument isn't really appropriate here. Usually non-copying techniques allow an operation to be performed in-place, but obviously one-hot encoding is not a candidate for that. Moreover, inherent in one-hot encoding is the user's admission that the data can be duplicated without memory problems: i.e. for input sized n, minimum memory usage 2_n_ is guaranteed. A temporary copy raising this to 3_n_ is not worth the additional parameter (especially as it looks like X_mask may already make this 4_n_ relative to 3_n_).

jnothman · 2016-05-03T12:34:20Z

btw, we should look at http://wdm0006.github.io/categorical_encoding/ some time

This includes an OrdinalEncoder like our LabelEncoder but for each feature. Except for that, an import of our OneHotEncoder, and a Feature Hasher implementation, all others are applications of patsy.C to each column of the input, with contrasts Diff, Poly, Sum and Helmert. I don't think it is relevant to this PR.

jnothman · 2016-05-03T23:32:05Z

@vighneshbirodkar, please avoid rewriting commit history for work that has been commented on and is being updated. It makes github links from email unusable. I assume that's the trap I've been falling in when trying to read your updates.

vighneshbirodkar · 2016-05-04T00:06:25Z

@jnothman Sorry about that. I will keep that in mind.

MechCoder · 2016-05-05T17:38:01Z

The following snippet illustrated why X[:, sel] and X[:, ind[sel]] are different

Thanks for the illustration! As @jnothman suggests that is why have safe_mask, but it of course easier (and shorter) to keep as is.

MechCoder · 2016-05-05T18:20:47Z

@jnothman @vighneshbirodkar I also think deprecating n_values="auto" being the default behaviour is a good idea. Whether or not you do this is done by an if-else clause and defaulting to the old behaviour, or fitting the LabelEncoder to the range(col) instead of col is up to you. I would do the second because it seems easier but it is personal preference (assuming there is no huge losses in speed)

jnothman · 2016-05-07T12:31:56Z

I also think deprecating n_values="auto" being the default behaviour is a good idea.

For what reason?

amueller · 2016-08-31T13:03:39Z

Bumpety bump!
@vighneshbirodkar do you have any time to work on this?
I know we usually don't want to push releases back for features, but I think we really need to fix this. Currently encoding strings without pandas is a hell of a pain and it's kind of embarrassing.
@jnothman @ogrisel what do you think?

amueller · 2016-08-31T13:06:06Z

sklearn/preprocessing/data.py

-        if np.any(X < 0):
-            raise ValueError("X needs to contain only non-negative integers.")
+    def _fit(self, X):
+        "Assumes `X` contains only catergorical features."


jnothman · 2016-08-31T13:07:02Z

Currently encoding strings without pandas is a hell of a pain and it's kind of embarrassing

Another book issue? :P

amueller · 2016-08-31T15:40:43Z

@jnothman this one will not make it into the book. But trying to write the book and having to call get_dummies and not being able to use OneHotEncoder felt so strange.

vighneshbirodkar · 2016-08-31T15:48:24Z

@jnothman @ogrisel @amueller
Notice the current behavior

from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder()
X = np.array([[10]])
print(enc.fit_transform(X).toarray()) # [[1. ]]
print(enc.transform([[7]]).toarray()) # [[0. ]]

The documentation says that we infer the range, then shouldn't the one-hot encoded output have 10 columns ?

amueller · 2016-08-31T15:50:07Z

I think the behavior "should be" that we infer unique values, and not ranges. If that's what we currently do, that's good. If it's not documented properly, that's bad.

vighneshbirodkar · 2016-08-31T17:43:55Z

@amueller
The documentation is that we infer ranges, and do it (sort of).
What's happening is that we first infer the range, and then remove the columns (after one-hot encoding) that have all zeros in them. Then during further transform, we completely ignore values which were not seen during fit, even if handle_unknown was set to "error" .

Here is an example

from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder(handle_unknown='error')
X = np.array([[1], [5], [7]])
enc.fit(X)

print(enc.transform([[7]]).toarray())  # [[ 0.  0.  1.]]
print(enc.transform([[1]]).toarray())  # [[ 1.  0.  0.]]
print(enc.transform([[6]]).toarray())  # [[ 0.  0.  0.]]

@jnothman Do you think all of this behavior should be retained ? I understand your point about backward compatibility and I agree that to maintain it we have to infer the range of values by default, but if we do that, we should expect seeing any value within the range during transform and encode it.

amueller · 2016-08-31T17:53:13Z

I feel that the current behavior is a total mess (that I take all blame for) and I think we should infer unique values and if there are new ones, we should error.

vighneshbirodkar · 2016-08-31T18:03:13Z

@amueller I second your opinion (we also talked about it on campus if you remember), but @jnothman had some concerns about backward compatibility (in this thread), which are also valid. But I am not sure to what extent we can maintain backward compatibility.

jnothman · 2016-08-31T23:37:58Z

I don't think there's anything wrong with all columns left blank because a value was unseen in training. Datasets aren't going to be stratified over all categorical variables. New values should only lead to errors with an option.

I think the only bug in the current approach is handle_unknown being limited to out-of-range type errors, which @vighneshbirodkar says is inconsistent with docs. Changing this behaviour is going to create errors where there were none before, but I'm inclined to say that's okay. It would not be okay if we weren't FOSS.

I think that the design in terms of ranges followed by a mask is ugly, but I don't think we have any excuse for disregarding backwards compatibility. I think we can deprecate the attributes that are tied to this behaviour.

The design limitation to integer features is unfortunate. We should support any hashable (or do we need orderable?) object. Note also that the strings case can be achieved with a wrapper around DictVectorizer.

amueller · 2016-09-01T02:49:05Z

how does the concept of ranges transfer to hashable objects?

jnothman · 2016-09-01T03:11:08Z

Without expending the time to look: I think because the ranges are now masked, we've effectively got sets of integers atm. That's a subset of hashable objects. Ranges only then would be a shorthand for specifying the set of values to create slots for... but backwards compatibility remains tricky then.

vighneshbirodkar · 2016-09-01T13:44:54Z

The documentation does not say anything about unseen values, it just states that the ~~encode~~ encoder determines the range of values. I think the current behavior is wrong because the encoder cannot encode all values in range during transform when it claims to. Here is how I think the OHE should work currently. I also think if we choose to make it backward compatible, that's th 10000 e behavior it should adopt for n_values='auto'. Here is a snippet to illustrate.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder(handle_unknown='error')
X = np.array([[1], [5], [7]])
enc.fit(X)

print(enc.transform([[7]]).toarray())  # [[ 0. 0. 0. 0. 0. 0. 0. 1.]]
print(enc.transform([[1]]).toarray())  # [[ 0. 1. 0. 0. 0. 0. 0. 1.]]
print(enc.transform([[6]]).toarray())  # [[ 0. 0. 0. 0. 0. 0. 1. 0.]]

@jnothman Do you think this is acceptable ?

jnothman · 2016-09-01T14:04:47Z

"Range" can be understood either way, IMO. I don't know what you're asking about being acceptable.

vighneshbirodkar · 2016-09-01T14:21:23Z

@jnothman I am asking if it's okay if the OHE behaves the way I had written in my last snippet. I think it's current behavior is a weird combination of inferring range and remembering values.

jnothman · 2016-09-01T14:37:56Z

@vighneshbirodkar. My summary of the situation is that:

We need to keep backwards compatibility in parameters and attributes (up to deprecation). The messy part is handle_unknown='error' where the prior behaviour wasn't obvious if n_values='auto'. My proposed, backwards-compatible solution: define class UnknownButInRangeWarning(FutureWarning) and warn that in the future an error will be raised if a value is unknown but in range for the 'auto' setting. Further, say that a user who wants the future behaviour can use warnings.simplefilter('error', UnknownButInRangeWarning). (Alternatively offer an 'error-strict' setting that will become default.)
I think a more user-friendly attribute interface would have:
- one attribute which specifies which output features encode which input features, either in the style of feature_indices_ but without the masking madness, or just as an array of indices. Either way we have to deprecate the current attributes and invent new attribute names.
- one attribute which specifies which input values are incoded for each output features
- I don't think that we should make the list of LabelEncoders public. If nothing else, this further entrenches confusion about label vs feature encoding.
Don't worry about the copy parameter. Let's assume always copying.
This discussion is long and unwieldy. Perhaps we should open a new PR to clean the slate.
@amueller is anxious to have this in 0.18. Do you feel up to having this complete in the next couple of days?

jnothman · 2016-09-01T14:39:08Z

And no, if values='auto' it should continue to only output those values seen in training.

GaelVaroquaux · 2016-09-01T14:40:09Z

I feel that the current behavior is a total mess (that I take all blame for) and I think we should infer unique values and if there are new ones, we should error.

I woud say: option to error or to have an "etc" class.

jnothman · 2016-09-01T14:46:31Z

I woud say: option to error or to have an "etc" class.

I think that is a separate issue altogether, and should remain out of this PR. (And please let us not call it a "class"!) Note that again the etc feature will go unused in a classic fit-predict paradigm, and a 'clip' setting would make more sense in that case...

GaelVaroquaux · 2016-09-01T14:50:06Z

I woud say: option to error or to have an "etc" class.
I think that is a separate issue altogether, and should remain out of this PR.

Maybe. It wasn't obvious to me

(And please let us not call it a "class"!)

Whatever. Terminology is a lost cause anyhow, as each microfield has it's
expectation.

Note that again the etc feature will go unused in a classic fit-predict
paradigm,

Why? I don't understand here.

and a 'clip' setting would make more sense in that case...

Possible, but that would change the number of samples, and that not an
option currently.

vighneshbirodkar · 2016-09-01T14:51:54Z

@jnothman I like your idea of using a UnknownButInRangeWarning. I will make a new PR for this, but I hope it's alright to use the same branch. Just to clarify, UnknownButInRangeWarning and error-strict should both be implemented or either one of them ?

jnothman · 2016-09-01T14:57:15Z

If we're implementing 'error-strict' I'd suggest a vanilla FutureWarning or DeprecationWarning.

vighneshbirodkar · 2016-09-01T14:59:29Z

@GaelVaroquaux @amueller I am in favor of @jnothman 's error-strict and eventually making it default idea. I can submit a PR for that by tomorrow.

jnothman · 2016-09-01T15:00:32Z

Note that again the etc feature will go unused in a classic fit-predict paradigm,

Why? I don't understand here.

You're right, Gaël, if the set of values is given explicitly, 'etc' remains useful. If values='auto' then the etc features will be empty in fit_transform.

re 'clip'
Possible, but that would change the number of samples, and that not an option currently.

I don't mean to change the number of samples, just to have a feature value that it defaults to if out of range and is then encoded.

I still think that further options for handle_unknown are a separate feature.

GaelVaroquaux · 2016-09-01T15:03:47Z

I still think that further options for handle_unknown are a separate feature.

That's OK with me. Feature creep is always dangerous. I was just wondering if I was missing something.

jnothman · 2016-09-01T15:05:43Z

Yes, we're already doing a couple of things here: cleaning up some weird behaviour and interface; and supporting non-int input. Arguably those too could be done separately except that the latter depends on and motivates the former.

vighneshbirodkar · 2016-09-02T10:28:56Z

Closed in favor of #7327

This was referenced Mar 30, 2016

[MRG] Added CategoricalEncoder class deprecating OneHotEncoder #6559

Closed

[WIP] One hot encoder now errors on any unknown categorical feature. #5270

Closed

vighneshbirodkar reviewed Mar 30, 2016
View reviewed changes

MechCoder changed the title ~~Refactored OneHotEncoder~~ [MRG] Refactored OneHotEncoder Mar 30, 2016

vighneshbirodkar force-pushed the ohe_fix branch 2 times, most recently from 507cc9f to d8e4781 Compare April 16, 2016 01:00

amueller added this to the 0.18 milestone Aug 31, 2016

amueller reviewed Aug 31, 2016
View reviewed changes

Added copy argument

023d0b6

Inbetween adding the seen option

b5dcd0a

remove seen argument and support range case with FutureWarning

398070f

vighneshbirodkar force-pushed the ohe_fix branch from 7319923 to 398070f Compare September 1, 2016 17:32

vighneshbirodkar closed this Sep 1, 2016

vighneshbirodkar mentioned this pull request Sep 6, 2016

[MRG] Added the "error-strict" option to OneHotEncoder and start deprecating unknown values in range #7327

Closed

4 tasks

Uh oh!

[MRG] Refactored OneHotEncoder #6602

[MRG] Refactored OneHotEncoder #6602

Uh oh!

Conversation

New Features in OneHotEncoder

Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Output on master

Output on this branch

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

New Features in `OneHotEncoder`