8000 [MRG] Refactored OneHotEncoder by vighneshbirodkar · Pull Request #6602 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Refactored OneHotEncoder #6602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

vighneshbirodkar
Copy link
Contributor

New Features in OneHotEncoder

  • Support for strings, negative integers ( and anything that can be put in an object array )
  • Specify discrete values using the classes parameter instead of n_values.
  • Uses a LabelEncoder instance for each column.

Changes

  • Changes _transform_selected to _apply_selected giving it the ability to optionally not return transformed values.
  • _apply_selected can no longer accept lists. It has to be given a np.array object. This is done because the input can be of np.object type and it cannot be always cast as a whole to np.int or np.float type. The transformed and non-transformed parts of the array are converted to the specified type before returning.
  • Instead of raising an error only when the feature in a column goes outside the max value, this change enables raising an error when any previously unknown value is seen

@vighneshbirodkar
Copy link
Contributor Author

@jnothman Please have a look at this.

def test_one_hot_encoder_categorical_features():
X = np.array([[3, 2, 1], [0, 1, 1]])
X2 = np.array([[1, 1, 1]])
X2 = np.array([[3, 1, 1]])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only potentially breaking behavior introduced. But not knowing if a value is in range but still unknown was a drawback of the original implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what you're saying has changed. Please be explicit; the reviewers are not right now as intimately familiar with this code as you are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following code

from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder(handle_unknown='error')
X = np.array([[3], [5], [7]])
enc.fit(X)
print(enc.transform([[4]]).toarray())

Output on master

[[ 0.  0.  0.]]

Output on this branch

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    print(enc.transform([[4]]).toarray())
  File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1845, in transform
    selected=self.categorical_features)
  File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1657, in _apply_selected
    return transform(X)
  File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1882, in _transform
    raise ValueError(msg)
ValueError: Unknown feature(s) [4] in column 0

The current implementation in master any value of a feature less than or equal to its maximum value during fit is acceptable when n_values="auto". With the above code run on the master branch, any value between [0,7] will not throw an error, in spite of the fact that it was never seen during fit.

To support strings I have to make the assumption that any value of a feature supplied during transform should either have been seen during fit, or specified explicitly using classes/values argument. This will break existing code like the one I have written above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documented as a bug-fix then.

That is when n_values is set to "auto", having categories in transform that are not present during training will raise an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't really a bug. It was behaving as it was documented. But we cannot keep that behavior if we choose to support strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It is clear to me what you expect. What do you think is the best way to go about this ? And an if/else clause and revert to the old logic when the array is int ? Or keep processing the array as object type and emulate the old functionality by using np.arange arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what you're asking, and think you should go write some test and implement something that works. If you want, ask for feedback after the tests are written that the tested functionality is correct, before you implement it. Once you've got an implementation we can talk about whether there's a better approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vighneshbirodkar Sorry, but I don't completely understand. From your original code snippet, you have explicitly set handle_unknown=error. How is not raising an error when you have a unknown sample during transform and have already set "handle_unknown=error" documented behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, the previous behaviour was to look at the range and not the actual values. Hmm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the consensus here (#5270), I forgot. To just error?

@MechCoder MechCoder changed the title Refactored OneHotEncoder [MRG] Refactored OneHotEncoder Mar 30, 2016
@amueller
Copy link
Member

btw, we should look at http://wdm0006.github.io/categorical_encoding/ some time

@vighneshbirodkar vighneshbirodkar force-pushed the ohe_fix branch 2 times, most recently from 507cc9f to d8e4781 Compare April 16, 2016 01:00
@vighneshbirodkar
Copy link
Contributor Author

@MechCoder The following snippet illustrated why X[:, sel] and X[:, ind[sel]] are different

>>> import numpy as np
>>> np.__version__
'1.7.1'
>>> import scipy
>>> scipy.__version__
'0.11.0'
>>> from scipy import sparse
>>> X = np.arange(16).reshape(4, 4)
>>> X = sparse.csr_matrix(X)
>>> sel = np.array([True, False, True, False], dtype=np.bool)
>>> ind = np.arange(4)
>>> ind = ind.astype(np.int)
>>> X[:, sel].toarray()
array([[ 1,  0,  1,  0],
       [ 5,  4,  5,  4],
       [ 9,  8,  9,  8],
       [13, 12, 13, 12]])
>>> X[:, ind[sel]].toarray()
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10],
       [12, 14]])

@vighneshbirodkar
Copy link
Contributor Author

@MechCoder @jnothman I believe I have addressed all your comments

@jnothman
Copy link
Member
jnothman commented May 3, 2016

I'm offline atm with a cached page and can't look at your code for what you mean by a copy; but I suspect that a copy argument isn't really appropriate here. Usually non-copying techniques allow an operation to be performed in-place, but obviously one-hot encoding is not a candidate for that. Moreover, inherent in one-hot encoding is the user's admission that the data can be duplicated without memory problems: i.e. for input sized n, minimum memory usage 2_n_ is guaranteed. A temporary copy raising this to 3_n_ is not worth the additional parameter (especially as it looks like X_mask may already make this 4_n_ relative to 3_n_).

@jnothman
Copy link
Member
jnothman commented May 3, 2016

btw, we should look at http://wdm0006.github.io/categorical_encoding/ some time

This includes an OrdinalEncoder like our LabelEncoder but for each feature. Except for that, an import of our OneHotEncoder, and a Feature Hasher implementation, all others are applications of patsy.C to each column of the input, with contrasts Diff, Poly, Sum and Helmert. I don't think it is relevant to this PR.

@jnothman
Copy link
Member
jnothman commented May 3, 2016

@vighneshbirodkar, please avoid rewriting commit history for work that has been commented on and is being updated. It makes github links from email unusable. I assume that's the trap I've been falling in when trying to read your updates.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman Sorry about that. I will keep that in mind.

@MechCoder
Copy link
Member

The following snippet illustrated why X[:, sel] and X[:, ind[sel]] are different

Thanks for the illustration! As @jnothman suggests that is why have safe_mask, but it of course easier (and shorter) to keep as is.

@MechCoder
Copy link
Member

@jnothman @vighneshbirodkar I also think deprecating n_values="auto" being the default behaviour is a good idea. Whether or not you do this is done by an if-else clause and defaulting to the old behaviour, or fitting the LabelEncoder to the range(col) instead of col is up to you. I would do the second because it seems easier but it is personal preference (assuming there is no huge losses in speed)

@jnothman
Copy link
Member
jnothman commented May 7, 2016

I also think deprecating n_values="auto" being the default behaviour is a good idea.

For what reason?

@amueller amueller added this to the 0.18 milestone Aug 31, 2016
@amueller
Copy link
Member

Bumpety bump!
@vighneshbirodkar do you have any time to work on this?
I know we usually don't want to push releases back for features, but I think we really need to fix this. Currently encoding strings without pandas is a hell of a pain and it's kind of embarrassing.
@jnothman @ogrisel what do you think?

if np.any(X < 0):
raise ValueError("X needs to contain only non-negative integers.")
def _fit(self, X):
"Assumes `X` contains only catergorical features."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@jnothman
Copy link
Member

Currently encoding strings without pandas is a hell of a pain and it's kind of embarrassing

Another book issue? :P

@amueller
Copy link
Member

@jnothman this one will not make it into the book. But trying to write the book and having to call get_dummies and not being able to use OneHotEncoder felt so strange.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman @ogrisel @amueller
Notice the current behavior

from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder()
X = np.array([[10]])
print(enc.fit_transform(X).toarray()) # [[1. ]]
print(enc.transform([[7]]).toarray()) # [[0. ]]

The documentation says that we infer the range, then shouldn't the one-hot encoded output have 10 columns ?

@amueller
Copy link
Member

I think the behavior "should be" that we infer unique values, and not ranges. If that's what we currently do, that's good. If it's not documented properly, that's bad.

@vighneshbirodkar
Copy link
Contributor Author

@amueller
The documentation is that we infer ranges, and do it (sort of).
What's happening is that we first infer the range, and then remove the columns (after one-hot encoding) that have all zeros in them. Then during further transform, we completely ignore values which were not seen during fit, even if handle_unknown was set to "error" .

Here is an example

from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder(handle_unknown='error')
X = np.array([[1], [5], [7]])
enc.fit(X)

print(enc.transform([[7]]).toarray())  # [[ 0.  0.  1.]]
print(enc.transform([[1]]).toarray())  # [[ 1.  0.  0.]]
print(enc.transform([[6]]).toarray())  # [[ 0.  0.  0.]]

@jnothman Do you think all of this behavior should be retained ? I understand your point about backward compatibility and I agree that to maintain it we have to infer the range of values by default, but if we do that, we should expect seeing any value within the range during transform and encode it.

@amueller
Copy link
Member

I feel that the current behavior is a total mess (that I take all blame for) and I think we should infer unique values and if there are new ones, we should error.

@vighneshbirodkar
Copy link
Contributor Author

@amueller I second your opinion (we also talked about it on campus if you remember), but @jnothman had some concerns about backward compatibility (in this thread), which are also valid. But I am not sure to what extent we can maintain backward compatibility.

@jnothman
Copy link
Member

I don't think there's anything wrong with all columns left blank because a value was unseen in training. Datasets aren't going to be stratified over all categorical variables. New values should only lead to errors with an option.

I think the only bug in the current approach is handle_unknown being limited to out-of-range type errors, which @vighneshbirodkar says is inconsistent with docs. Changing this behaviour is going to create errors where there were none before, but I'm inclined to say that's okay. It would not be okay if we weren't FOSS.

I think that the design in terms of ranges followed by a mask is ugly, but I don't think we have any excuse for disregarding backwards compatibility. I think we can deprecate the attributes that are tied to this behaviour.

The design limitation to integer features is unfortunate. We should support any hashable (or do we need orderable?) object. Note also that the strings case can be achieved with a wrapper around DictVectorizer.

@amueller
Copy link
Member
amueller commented Sep 1, 2016

how does the concept of ranges transfer to hashable objects?

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

Without expending the time to look: I think because the ranges are now masked, we've effectively got sets of integers atm. That's a subset of hashable objects. Ranges only then would be a shorthand for specifying the set of values to create slots for... but backwards compatibility remains tricky then.

@vighneshbirodkar
Copy link
Contributor Author
vighneshbirodkar commented Sep 1, 2016

The documentation does not say anything about unseen values, it just states that the encode encoder determines the range of values. I think the current behavior is wrong because the encoder cannot encode all values in range during transform when it claims to. Here is how I think the OHE should work currently. I also think if we choose to make it backward compatible, that's the behavior it should adopt for n_values='auto'. Here is a snippet to illustrate.

from sklearn.preprocessing impor
10000
t OneHotEncoder
import numpy as np

enc = OneHotEncoder(handle_unknown='error')
X = np.array([[1], [5], [7]])
enc.fit(X)

print(enc.transform([[7]]).toarray())  # [[ 0. 0. 0. 0. 0. 0. 0. 1.]]
print(enc.transform([[1]]).toarray())  # [[ 0. 1. 0. 0. 0. 0. 0. 1.]]
print(enc.transform([[6]]).toarray())  # [[ 0. 0. 0. 0. 0. 0. 1. 0.]]

@jnothman Do you think this is acceptable ?

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

"Range" can be understood either way, IMO. I don't know what you're asking about being acceptable.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman I am asking if it's okay if the OHE behaves the way I had written in my last snippet. I think it's current behavior is a weird combination of inferring range and remembering values.

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

@vighneshbirodkar. My summary of the situation is that:

  • We need to keep backwards compatibility in parameters and attributes (up to deprecation). The messy part is handle_unknown='error' where the prior behaviour wasn't obvious if n_values='auto'. My proposed, backwards-compatible solution: define class UnknownButInRangeWarning(FutureWarning) and warn that in the future an error will be raised if a value is unknown but in range for the 'auto' setting. Further, say that a user who wants the future behaviour can use warnings.simplefilter('error', UnknownButInRangeWarning). (Alternatively offer an 'error-strict' setting that will become default.)
  • I think a more user-friendly attribute interface would have:
    • one attribute which specifies which output features encode which input features, either in the style of feature_indices_ but without the masking madness, or just as an array of indices. Either way we have to deprecate the current attributes and invent new attribute names.
    • one attribute which specifies which input values are incoded for each output features
    • I don't think that we should make the list of LabelEncoders public. If nothing else, this further entrenches confusion about label vs feature encoding.
  • Don't worry about the copy parameter. Let's assume always copying.
  • This discussion is long and unwieldy. Perhaps we should open a new PR to clean the slate.
  • @amueller is anxious to have this in 0.18. Do you feel up to having this complete in the next couple of days?

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

And no, if values='auto' it should continue to only output those values seen in training.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Sep 1, 2016 via email

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

I woud say: option to error or to have an "etc" class.

I think that is a separate issue altogether, and should remain out of this PR. (And please let us not call it a "class"!) Note that again the etc feature will go unused in a classic fit-predict paradigm, and a 'clip' setting would make more sense in that case...

@GaelVaroquaux
Copy link
Member
I woud say: option to error or to have an "etc" class.

I think that is a separate issue altogether, and should remain out of this PR.

Maybe. It wasn't obvious to me

(And please let us not call it a "class"!)

Whatever. Terminology is a lost cause anyhow, as each microfield has it's
expectation.

Note that again the etc feature will go unused in a classic fit-predict
paradigm,

Why? I don't understand here.

and a 'clip' setting would make more sense in that case...

Possible, but that would change the number of samples, and that not an
option currently.

@vighneshbirodkar
Copy link
Contributor Author

@jnothman I like your idea of using a UnknownButInRangeWarning. I will make a new PR for this, but I hope it's alright to use the same branch. Just to clarify, UnknownButInRangeWarning and error-strict should both be implemented or either one of them ?

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

If we're implementing 'error-strict' I'd suggest a vanilla FutureWarning or DeprecationWarning.

@vighneshbirodkar
Copy link
Contributor Author

@GaelVaroquaux @amueller I am in favor of @jnothman 's error-strict and eventually making it default idea. I can submit a PR for that by tomorrow.

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

Note that again the etc feature will go unused in a classic fit-predict paradigm,

Why? I don't understand here.

You're right, Gaël, if the set of values is given explicitly, 'etc' remains useful. If values='auto' then the etc features will be empty in fit_transform.

re 'clip'
Possible, but that would change the number of samples, and that not an option currently.

I don't mean to change the number of samples, just to have a feature value that it defaults to if out of range and is then encoded.

I still think that further options for handle_unknown are a separate feature.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Sep 1, 2016 via email

@jnothman
Copy link
Member
jnothman commented Sep 1, 2016

Yes, we're already doing a couple of things here: cleaning up some weird behaviour and interface; and supporting non-int input. Arguably those too could be done separately except that the latter depends on and motivates the former.

@vighneshbirodkar
Copy link
Contributor Author

Closed in favor of #7327

8DE3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0