[WIP] OrdinalEncoder functionality for unknown categories when transforming #13808
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #13488, see also #12153
What does this implement/fix? Explain your changes.
Currently
OrdinalEncoder
breaks if encounters unknown/unseen categories during transform. This PR adds two functionalities:handle_unknown = 'cast'
, which as discussed in Handle Error Policy in OrdinalEncoder #13488 adds a new category given by the user viaunknown_category
(defaults to'unknown'
). It will also add this category to thecategories_
attribute where necessary in order forinverse_transform
to work accordingly.handle_unknown = 'ignore'
, which will remove the observation (row), keeping valid observations across the different features (columns).Any other comments?
In #13488 it was discussed to introduce
min_frequency
or an extra base class as mentioned in #12153, which I think is an overkill, at least to start. In my opinion if that much complicated logic is necessary it should be the user doing it with a customize pipeline, a user cannot expect sklearn to tackle such specific case in my opinion.This PR is only a few lines of code and I believe it can easily be extended if indeed
min_frequency
is desirable since the only change will be in the base class and in the return of the mask from_transform
.Example:
I know I have not comply with PEP8 nor wrote tests yet but I want to first get your opinion on this PR before putting more effort.