[WIP] OrdinalEncoder functionality for unknown categories when transforming#13808
Closed
rragundez wants to merge 6 commits intoscikit-learn:masterfrom
rragundez:unseen-ordinal-transformer
Closed
[WIP] OrdinalEncoder functionality for unknown categories when transforming#13808rragundez wants to merge 6 commits intoscikit-learn:masterfrom rragundez:unseen-ordinal-transformer
rragundez wants to merge 6 commits intoscikit-learn:masterfrom
rragundez:unseen-ordinal-transformer
Conversation
Contributor
Author
|
could I get a quick review @ogrisel ? |
Contributor
Author
|
just pinging for a quick review... I would like to move forward. |
Member
|
@rragundez Thank you for the PR! The API of this feature has not been finalized yet. Some concerns are:
|
Contributor
Author
|
Thanks @thomasjpfan, I understand the logic behind your points and I agree with them. I will close this PR now and check if i can submit another one with the points you raised. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #13488, see also #12153
What does this implement/fix? Explain your changes.
Currently
OrdinalEncoderbreaks if encounters unknown/unseen categories during transform. This PR adds two functionalities:handle_unknown = 'cast', which as discussed in Handle Error Policy in OrdinalEncoder #13488 adds a new category given by the user viaunknown_category(defaults to'unknown'). It will also add this category to thecategories_attribute where necessary in order forinverse_transformto work accordingly.handle_unknown = 'ignore', which will remove the observation (row), keeping valid observations across the different features (columns).Any other comments?
In #13488 it was discussed to introduce
min_frequencyor an extra base class as mentioned in #12153, which I think is an overkill, at least to start. In my opinion if that much complicated logic is necessary it should be the user doing it with a customize pipeline, a user cannot expect sklearn to tackle such specific case in my opinion.This PR is only a few lines of code and I believe it can easily be extended if indeed
min_frequencyis desirable since the only change will be in the base class and in the return of the mask from_transform.Example:
I know I have not comply with PEP8 nor wrote tests yet but I want to first get your opinion on this PR before putting more effort.