-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Refactored OneHotEncoder #6602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman Please have a look at this. |
def test_one_hot_encoder_categorical_features(): | ||
X = np.array([[3, 2, 1], [0, 1, 1]]) | ||
X2 = np.array([[1, 1, 1]]) | ||
X2 = np.array([[3, 1, 1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only potentially breaking behavior introduced. But not knowing if a value is in range but still unknown was a drawback of the original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what you're saying has changed. Please be explicit; the reviewers are not right now as intimately familiar with this code as you are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following code
from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder(handle_unknown='error')
X = np.array([[3], [5], [7]])
enc.fit(X)
print(enc.transform([[4]]).toarray())
Output on master
[[ 0. 0. 0.]]
Output on this branch
Traceback (most recent call last):
File "test.py", line 9, in <module>
print(enc.transform([[4]]).toarray())
File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1845, in transform
selected=self.categorical_features)
File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1657, in _apply_selected
return transform(X)
File "/home/vighnesh/git/scikit-learn/sklearn/preprocessing/data.py", line 1882, in _transform
raise ValueError(msg)
ValueError: Unknown feature(s) [4] in column 0
The current implementation in master any value of a feature less than or equal to its maximum value during fit is acceptable when n_values="auto"
. With the above code run on the master branch, any value between [0,7]
will not throw an error, in spite of the fact that it was never seen during fit.
To support strings I have to make the assumption that any value of a feature supplied during transform should either have been seen during fit, or specified explicitly using classes/values
argument. This will break existing code like the one I have written above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be documented as a bug-fix then.
That is when n_values
is set to "auto", having categories in transform that are not present during training will raise an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't really a bug. It was behaving as it was documented. But we cannot keep that behavior if we choose to support strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It is clear to me what you expect. What do you think is the best way to go about this ? And an if/else clause and revert to the old logic when the array is int ? Or keep processing the array as object
type and emulate the old functionality by using np.arange
arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what you're asking, and think you should go write some test and implement something that works. If you want, ask for feedback after the tests are written that the tested functionality is correct, before you implement it. Once you've got an implementation we can talk about whether there's a better approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vighneshbirodkar Sorry, but I don't completely understand. From your original code snippet, you have explicitly set handle_unknown=error
. How is not raising an error when you have a unknown sample during transform and have already set "handle_unknown=error" documented behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, the previous behaviour was to look at the range and not the actual values. Hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the consensus here (#5270), I forgot. To just error?
btw, we should look at http://wdm0006.github.io/categorical_encoding/ some time |
507cc9f
to
d8e4781
Compare
@MechCoder The following snippet illustrated why >>> import numpy as np
>>> np.__version__
'1.7.1'
>>> import scipy
>>> scipy.__version__
'0.11.0'
>>> from scipy import sparse
>>> X = np.arange(16).reshape(4, 4)
>>> X = sparse.csr_matrix(X)
>>> sel = np.array([True, False, True, False], dtype=np.bool)
>>> ind = np.arange(4)
>>> ind = ind.astype(np.int)
>>> X[:, sel].toarray()
array([[ 1, 0, 1, 0],
[ 5, 4, 5, 4],
[ 9, 8, 9, 8],
[13, 12, 13, 12]])
>>> X[:, ind[sel]].toarray()
array([[ 0, 2],
[ 4, 6],
[ 8, 10],
[12, 14]]) |
@MechCoder @jnothman I believe I have addressed all your comments |
I'm offline atm with a cached page and can't look at your code for what you mean by a |
This includes an |
@vighneshbirodkar, please avoid rewriting commit history for work that has been commented on and is being updated. It makes github links from email unusable. I assume that's the trap I've been falling in when trying to read your updates. |
@jnothman Sorry about that. I will keep that in mind. |
Thanks for the illustration! As @jnothman suggests that is why have |
@jnothman @vighneshbirodkar I also think deprecating |
For what reason? |
Bumpety bump! |
if np.any(X < 0): | ||
raise ValueError("X needs to contain only non-negative integers.") | ||
def _fit(self, X): | ||
"Assumes `X` contains only catergorical features." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
Another book issue? :P |
@jnothman this one will not make it into the book. But trying to write the book and having to call |
@jnothman @ogrisel @amueller from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder()
X = np.array([[10]])
print(enc.fit_transform(X).toarray()) # [[1. ]]
print(enc.transform([[7]]).toarray()) # [[0. ]] The documentation says that we infer the range, then shouldn't the one-hot encoded output have 10 columns ? |
I think the behavior "should be" that we infer unique values, and not ranges. If that's what we currently do, that's good. If it's not documented properly, that's bad. |
@amueller Here is an example
@jnothman Do you think all of this behavior should be retained ? I understand your point about backward compatibility and I agree that to maintain it we have to infer the range of values by default, but if we do that, we should expect seeing any value within the range during transform and encode it. |
I feel that the current behavior is a total mess (that I take all blame for) and I think we should infer unique values and if there are new ones, we should error. |
I don't think there's anything wrong with all columns left blank because a value was unseen in training. Datasets aren't going to be stratified over all categorical variables. New values should only lead to errors with an option. I think the only bug in the current approach is I think that the design in terms of ranges followed by a mask is ugly, but I don't think we have any excuse for disregarding backwards compatibility. I think we can deprecate the attributes that are tied to this behaviour. The design limitation to integer features is unfortunate. We should support any hashable (or do we need orderable?) object. Note also that the strings case can be achieved with a wrapper around |
how does the concept of ranges transfer to hashable objects? |
Without expending the time to look: I think because the ranges are now masked, we've effectively got sets of integers atm. That's a subset of hashable objects. Ranges only then would be a shorthand for specifying the set of values to create slots for... but backwards compatibility remains tricky then. |
The documentation does not say anything about unseen values, it just states that the
@jnothman Do you think this is acceptable ? |
"Range" can be understood either way, IMO. I don't know what you're asking about being acceptable. |
@jnothman I am asking if it's okay if the OHE behaves the way I had written in my last snippet. I think it's current behavior is a weird combination of inferring range and remembering values. |
@vighneshbirodkar. My summary of the situation is that:
|
And no, if |
I feel that the current behavior is a total mess (that I take all blame for)
and I think we should infer unique values and if there are new ones, we should
error.
I woud say: option to error or to have an "etc" class.
|
I think that is a separate issue altogether, and should remain out of this PR. (And please let us not call it a "class"!) Note that again the etc feature will go unused in a classic fit-predict paradigm, and a 'clip' setting would make more sense in that case... |
Maybe. It wasn't obvious to me
Whatever. Terminology is a lost cause anyhow, as each microfield has it's
Why? I don't understand here.
Possible, but that would change the number of samples, and that not an |
@jnothman I like your idea of using a |
If we're implementing |
@GaelVaroquaux @amueller I am in favor of @jnothman 's |
You're right, Gaël, if the set of values is given explicitly, 'etc' remains useful. If
I don't mean to change the number of samples, just to have a feature value that it defaults to if out of range and is then encoded. I still think that further options for |
I still think that further options for handle_unknown are a separate feature.
That's OK with me. Feature creep is always dangerous. I was just
wondering if I was missing something.
|
Yes, we're already doing a couple of things here: cleaning up some weird behaviour and interface; and supporting non-int input. Arguably those too could be done separately except that the latter depends on and motivates the former. |
7319923
to
398070f
Compare
Closed in favor of #7327 |
New Features in
OneHotEncoder
classes
parameter instead ofn_values
.LabelEncoder
instance for each column.Changes
_transform_selected
to_apply_selected
giving it the ability to optionally not return transformed values._apply_selected
can no longer accept lists. It has to be given anp.array
object. This is done because the input can be ofnp.object
type and it cannot be always cast as a whole tonp.int
ornp.float
type. The transformed and non-transformed parts of the array are converted to the specified type before returning.