[WIP] One hot encoder now errors on any unknown categorical feature.#5270
[WIP] One hot encoder now errors on any unknown categorical feature.#5270vighneshbirodkar wants to merge 14 commits intoscikit-learn:masterfrom
Conversation
|
Ping @amueller , or can you cc someone else who can maybe take a look at this ? |
|
Sorry still home sick, and the release is a priority atm. |
|
@amueller You mean these changes themselves ? Or this against the alternative implementation I talked to you about ? |
|
I meant benchmarking this against master. Sorry for the slow turn-around. |
|
Times for fit and predict respectively for 10000 rows and 1000 cols with each column having 100 unique features MasterThis branch |
|
So the first is fit and the second predict? It's always good to post the benchmark script, here or as a gist. |
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
import time
ROWS = 10000
COLS = 1000
enc = OneHotEncoder(handle_unknown='error')
data = np.random.randint(1,100,size=(ROWS, COLS))
%time enc.fit(data)
%time enc.transform(data)MasterThis branch |
|
@amueller The Wall times vary significantly, I can't seem to figure out why |
|
That's probably because you are running another process? Can you use |
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
import time
ROWS = 10000
COLS = 1000
enc = OneHotEncoder(handle_unknown='error')
data = np.random.randint(1,100,size=(ROWS, COLS))
%timeit -n 5 -p 5 enc.fit(data)
%timeit -n 5 -p 5 enc.transform(data)MasterThis BranchThere is really no reason for a speedup on this branch, I guess the times are really similar. |
sklearn/preprocessing/data.py
Outdated
There was a problem hiding this comment.
All attributes should end with _ and be documented in the class.
Sorry, something went wrong.
|
@amueller Is the check |
|
@MechCoder well because of the current semantics... |
sklearn/preprocessing/data.py
Outdated
There was a problem hiding this comment.
Good point, no it's not, will add a test soon
There was a problem hiding this comment.
why is handle_unknown now only validated when n_values="auto"?
There was a problem hiding this comment.
Because handle_unknown isn't accessed unless n_values is "auto"
|
@amueller Any more suggestions for this ? |
|
I'll try to review soon. Sorry, I'm a bit busy with release work at the moment. |
|
No problem. |
e020031 to
5fba797
Compare
|
@MechCoder WDYT ? |
sklearn/preprocessing/data.py
Outdated
There was a problem hiding this comment.
I wouldn't use backticks in the first line. I think is messes up the rendering.
|
@amueller |
sklearn/preprocessing/data.py
Outdated
| self | ||
| """ | ||
| if self.handle_unknown not in ['error', 'ignore']: | ||
| template = ("handle_unknown should be either error or " |
There was a problem hiding this comment.
maybe single quotes around 'error' and 'ignore'?
|
@amueller |
|
LGTM |
|
I just realized that we kinda need to break backward compatibility if we want strings to work with this.... meh. |
|
Ok either merge this or deprecate OneHotEncoder and implement it more reasonably ;) |
|
I am in favor of a new implementation. |
|
@amueller @MechCoder Edit : |
|
Closing in favor of #6602 |
See the new test case added which fails on the current master.