-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] One hot encoder now errors on any unknown categorical feature. #5270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ping @amueller , or can you cc someone else who can maybe take a look at this ? |
Sorry still home sick, and the release is a priority atm. |
@amueller You mean these changes themselves ? Or this against the alternative implementation I talked to you about ? |
I meant benchmarking this against master. Sorry for the slow turn-around. |
Times for fit and predict respectively for 10000 rows and 1000 cols with each column having 100 unique features Master
This branch
|
So the first is fit and the second predict? It's always good to post the benchmark script, here or as a gist. |
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
import time
ROWS = 10000
COLS = 1000
enc = OneHotEncoder(handle_unknown='error')
data = np.random.randint(1,100,size=(ROWS, COLS))
%time enc.fit(data)
%time enc.transform(data) Master
This branch
|
@amueller The Wall times vary significantly, I can't seem to figure out why |
That's probably because you are running another process? Can you use |
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
import time
ROWS = 10000
COLS = 1000
enc = OneHotEncoder(handle_unknown='error')
data = np.random.randint(1,100,size=(ROWS, COLS))
%timeit -n 5 -p 5 enc.fit(data)
%timeit -n 5 -p 5 enc.transform(data) Master
This Branch
There is really no reason for a speedup on this branch, I guess the times are really similar. |
self.feature_indices_[i], | ||
split_arrays[i]) | ||
for i in range(nf)] | ||
self.unique_samples = samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All attributes should end with _
and be documented in the class.
@amueller Is the check |
@MechCoder well because of the current semantics... |
train_classes = set(self.unique_samples_[i]) | ||
|
||
if not found_classes.issubset(train_classes): | ||
if self.handle_unknown not in ['error', 'ignore']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, no it's not, will add a test soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is handle_unknown
now only validated when n_values="auto"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because handle_unknown
isn't accessed unless n_values
is "auto"
@amueller Any more suggestions for this ? |
I'll try to review soon. Sorry, I'm a bit busy with release work at the moment. |
No problem. |
e020031
to
5fba797
Compare
@MechCoder WDYT ? |
@@ -1704,6 +1704,10 @@ class OneHotEncoder(BaseEstimator, TransformerMixin): | |||
n_values_ : array of shape (n_features,) | |||
Maximum number of values per feature. | |||
|
|||
unique_samples_ : list of length `n_features` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't use backticks in the first line. I think is messes up the rendering.
@amueller |
@@ -1750,6 +1754,10 @@ def fit(self, X, y=None): | |||
------- | |||
self | |||
""" | |||
if self.handle_unknown not in ['error', 'ignore']: | |||
template = ("handle_unknown should be either error or " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe single quotes around 'error'
and 'ignore'
?
@amueller |
LGTM |
I just realized that we kinda need to break backward compatibility if we want strings to work with this.... meh. |
Ok either merge this or deprecate OneHotEncoder and implement it more reasonably ;) |
I am in favor of a new implementation. |
@amueller @MechCoder Edit : |
Closing in favor of #6602 |
See the new test case added which fails on the current master.