8000 [WIP] One hot encoder now errors on any unknown categorical feature. by vighneshbirodkar · Pull Request #5270 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] One hot encoder now errors on any unknown categorical feature. #5270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

vighneshbirodkar
Copy link
Contributor

See the new test case added which fails on the current master.

@vighneshbirodkar
Copy link
Contributor Author

Ping @amueller , or can you cc someone else who can maybe take a look at this ?

@amueller
Copy link
Member

Sorry still home sick, and the release is a priority atm.
Can you benchmark your changes please?

@vighneshbirodkar
Copy link
Contributor Author

@amueller You mean these changes themselves ? Or this against the alternative implementation I talked to you about ?

@amueller
Copy link
Member

I meant benchmarking this against master. Sorry for the slow turn-around.

@vighneshbirodkar
Copy link
Contributor Author

Times for fit and predict respectively for 10000 rows and 1000 cols with each column having 100 unique features

Master

CPU times: user 978 ms, sys: 263 ms, total: 1.24 s
Wall time: 1.24 s
CPU times: user 1.29 s, sys: 544 ms, total: 1.83 s
Wall time: 1.84 s

This branch

CPU times: user 868 ms, sys: 420 ms, total: 1.29 s
Wall time: 3.76 s
CPU times: user 1.32 s, sys: 648 ms, total: 1.97 s
Wall time: 1.97 s

@amueller
Copy link
Member

So the first is fit and the second predict? It's always good to post the benchmark script, here or as a gist.
The wall time are very different but the cpu time are not. Can you rerun?

@vighneshbirodkar
Copy link
Contributor Author
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
import time

ROWS = 10000
COLS = 1000
enc = OneHotEncoder(handle_unknown='error')
data = np.random.randint(1,100,size=(ROWS, COLS))


%time enc.fit(data)
%time enc.transform(data)

Master

CPU times: user 890 ms, sys: 373 ms, total: 1.26 s
Wall time: 3.21 s
CPU times: user 1.33 s, sys: 392 ms, total: 1.72 s
Wall time: 1.72 s

This branch

CPU times: user 877 ms, sys: 327 ms, total: 1.2 s
Wall time: 2 s
CPU times: user 1.3 s, sys: 399 ms, total: 1.7 s
Wall time: 1.7 s

@vighneshbirodkar
Copy link
Contributor Author

@amueller The Wall times vary significantly, I can't seem to figure out why

@amueller
Copy link
Member

That's probably because you are running another process? Can you use timeit?

@vighneshbirodkar
Copy link
Contributor Author
from sklearn.preprocessing.data import OneHotEncoder
import numpy as np
import time

ROWS = 10000
COLS = 1000
enc = OneHotEncoder(handle_unknown='error')
data = np.random.randint(1,100,size=(ROWS, COLS))


%timeit -n 5 -p 5 enc.fit(data)

%timeit -n 5 -p 5 enc.transform(data)

Master

5 loops, best of 3: 1.1307 s per loop
5 loops, best of 3: 1.3384 s per loop

This Branch

5 loops, best of 3: 1.2471 s per loop
5 loops, best of 3: 1.2761 s per loop

There is really no reason for a speedup on this branch, I guess the times are really similar.

@amueller amueller changed the title One hot encoder now errors on any unknown categorical feature. [MRG] One hot encoder now errors on any unknown categorical feature. Oct 16, 2015
self.feature_indices_[i],
split_arrays[i])
for i in range(nf)]
self.unique_samples = samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All attributes should end with _ and be documented in the class.

@MechCoder
Copy link
Member

@amueller Is the check np.any(X < 0) because of the current logic in the OneHotEncoder code? :P

@amueller
Copy link
Member

@MechCoder well because of the current semantics...

train_classes = set(self.unique_samples_[i])

if not found_classes.issubset(train_classes):
if self.handle_unknown not in ['error', 'ignore']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, no it's not, will add a test soon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is handle_unknown now only validated when n_values="auto"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because handle_unknown isn't accessed unless n_values is "auto"

@vighneshbirodkar
Copy link
Contributor Author

@amueller Any more suggestions for this ?

@amueller
Copy link
Member
amueller commented Nov 2, 2015

I'll try to review soon. Sorry, I'm a bit busy with release work at the moment.

@vighneshbirodkar
Copy link
Contributor Author

No problem.

@vighneshbirodkar
Copy link
Contributor Author

@MechCoder WDYT ?

@@ -1704,6 +1704,10 @@ class OneHotEncoder(BaseEstimator, TransformerMixin):
n_values_ : array of shape (n_features,)
Maximum number of values per feature.

unique_samples_ : list of length `n_features`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use backticks in the first line. I think is messes up the rendering.

@vighneshbirodkar
Copy link
Contributor Author

@amueller
Because we can encounter unknown features even during fit, I have moved the validation of handle_unknown inside fit

@@ -1750,6 +1754,10 @@ def fit(self, X, y=None):
-------
self
"""
if self.handle_unknown not in ['error', 'ignore']:
template = ("handle_unknown should be either error or "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe single quotes around 'error' and 'ignore'?

@vighneshbirodkar
Copy link
Contributor Author

@amueller
There is an old saying in my native language which roughly translates to "You searched the entire village for a vessel which was under your hand the whole time". The code we needed was lying just a few lines below, I should have noticed it before.

@amueller
Copy link
Member

LGTM

@amueller amueller changed the title [MRG] One hot encoder now errors on any unknown categorical feature. [MRG + 1] One hot encoder now errors on any unknown categorical feature. Feb 22, 2016
@amueller
Copy link
Member

I just realized that we kinda need to break backward compatibility if we want strings to work with this.... meh.

@amueller
Copy link
Member

Ok either merge this or deprecate OneHotEncoder and implement it more reasonably ;)

@vighneshbirodkar
Copy link
Contributor Author

I am in favor of a new implementation.

@vighneshbirodkar vighneshbirodkar changed the title [MRG + 1] One hot encoder now errors on any unknown categorical feature. [WIP] One hot encoder now errors on any unknown categorical feature. Mar 12, 2016
@vighneshbirodkar
Copy link
Contributor Author

@amueller @MechCoder
I wanted to know your opinion about this. Instead of creating a new class I have tried to modify OneHotEncoder and make it backward compatible. I had to remove some attributes, but it supports the n_values parameter in a backward compatible manner.

Edit :
I would like to add that the attributes active_features_ and and features_indices_ are really internal variables and I doubt if any code using sklearn relies on them.

@vighneshbirodkar
Copy link
Contributor Author

Closing in favor of #6602

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0