8000 [WIP] OrdinalEncoder functionality for unknown categories when transforming by rragundez · Pull Request #13808 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] OrdinalEncoder functionality for unknown categories when transforming #13808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

[WIP] OrdinalEncoder functionality for unknown categories when transforming #13808

wants to merge 6 commits into from

Conversation

rragundez
Copy link
Contributor
@rragundez rragundez commented May 6, 2019

Reference Issues/PRs

Fixes #13488, see also #12153

What does this implement/fix? Explain your changes.

Currently OrdinalEncoder breaks if encounters unknown/unseen categories during transform. This PR adds two functionalities:

  • handle_unknown = 'cast', which as discussed in Handle Error Policy in OrdinalEncoder #13488 adds a new category given by the user via unknown_category (defaults to 'unknown'). It will also add this category to the categories_ attribute where necessary in order for inverse_transform to work accordingly.

  • handle_unknown = 'ignore', which will remove the observation (row), keeping valid observations across the different features (columns).

Any other comments?

In #13488 it was discussed to introduce min_frequency or an extra base class as mentioned in #12153, which I think is an overkill, at least to start. In my opinion if that much complicated logic is necessary it should be the user doing it with a customize pipeline, a user cannot expect sklearn to tackle such specific case in my opinion.
This PR is only a few lines of code and I believe it can easily be extended if indeed min_frequency is desirable since the only change will be in the base class and in the return of the mask from _transform.

Example:

df = pd.DataFrame({
    'col_0': ['dog', 'horse', 'cat', 'cat', 'dog', 'cat', 'horse'],
    'col_1': ['A', 'A', 'C', 'B', 'Z', 'C', 'C']
})
df_test = pd.DataFrame({
    'col_0': ['dog', 'horse', 'cat', 'cat', 'dog', 'cat', 'horse'],
    'col_1': ['A', 'A', 'something_else', 'B', 'Z', 'C', 'SOMETHING']
})

# cast 
b = OrdinalEncoder(handle_unknown='cast')
b.fit(df)
x = b.transform(df_test)
print(x)
>>>
[[1. 0.]
 [2. 0.]
 [0. 4.]
 [0. 1.]
 [1. 3.]
 [0. 2.]
 [2. 4.]]

print(b.inverse_transform(x))
>>>
[['dog' 'A']
 ['horse' 'A']
 ['cat' 'unknown']
 ['cat' 'B']
 ['dog' 'Z']
 ['cat' 'C']
 ['horse' 'unknown']]

print(b.categories_)
>>>
[array(['cat', 'dog', 'horse'], dtype=object), array(['A', 'B', 'C', 'Z', 'unknown'], dtype=object)]

# ignore 
b = OrdinalEncoder(handle_unknown='ignore')
b.fit(df)
x = b.transform(df_test)
print(x)
>>>
[[1. 0.]
 [2. 0.]
 [0. 1.]
 [1. 3.]
 [0. 2.]]

print(b.inverse_transform(x))
>>>
[['dog' 'A']
 ['horse' 'A']
 ['cat' 'B']
 ['dog' 'Z']
 ['cat' 'C']]

print(b.categories_)
>>> [array(['cat', 'dog', 'horse'], dtype=object), array(['A', 'B', 'C', 'Z'], dtype=object)]

I know I have not comply with PEP8 nor wrote tests yet but I want to first get your opinion on this PR before putting more effort.

@rragundez
Copy link
Contributor Author

could I get a quick review @ogrisel ?

@rragundez
Copy link
Contributor Author
rragundez commented Jun 6, 2019

just pinging for a quick review... I would like to move forward.

@thomasjpfan
Copy link
Member

@rragundez Thank you for the PR! The API of this feature has not been finalized yet. Some concerns are:

  1. What if the unknown_category is a category already used in the dataset? Raising an error would be reasonable.
  2. The order of the categories_ attribute in the "OrdinalEncoder" is currently ordered (to keep its ordinalness). Adding the unkown_category to the end would break this. (This is most likely okay)
  3. Removing number rows would not work. It will break a bunch of things, like metrics.
  4. Since there is a desire to handle unknown and infrequent (maybe missing), it is best to think about the API now. (We do not want to add unknown support and find out we need to deprecate it because of how it interacts with infrequent.) And since OrdinalEncoder and OneHotEncoder share some code, we would need to think about how to handle unknown, infrequent, and maybe missing in both contexts.

@rragundez
Copy link
Contributor Author

Thanks @thomasjpfan, I understand the logic behind your points and I agree with them. I will close this PR now and check if i can submit another one with the points you raised.

@rragundez rragundez closed this Jun 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Error Policy in OrdinalEncoder
2 participants
0