Please provide option to set unknown_values during test time to same as encoded min_frequency in OrdinalEncoder(Infrequent categories) · Issue #27629 · scikit-learn/scikit-learn · GitHub
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that OneHotEncoder has a parameter for setting handle_unknown='infrequent_if_exist' but the same is missing in OrdinalEncoder . Currently unknown_value and the value encoded by setting the parameter min_frequency seems to be different. There is always workaround to figure out the encoded value on min_frequency and pass the same to unknown_values but I think having something similar to OneHotEncoder's parameter handle_unknown='infrequent_if_exist' seems intuitive as we would want to treat unseen values as infrequent ones. Not sure if this feature already exists and I'm missing it somehow.
Describe your proposed solution
Implement parameter option similar to OneHotEncoder's parameter handle_unknown='infrequent_if_exist' where unknown (unseen values during training) get similar encoding as happened for infrequent_categories during training.
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
If you eliminate the distinction between infrequent categories and the missing category, there is also no need for the infrequent_categories_ attribute, which can require a lot of storage on a large dataset where many of the infrequent values are unique. Having an option to not store those categories at all within the encoder's state would make it much smaller when pickled. This is a rather important change for being able to use OrdinalEncoder with confidence on high-cardinality columns (where some of the values may still be frequent) without fearing it will massively blow up the size of the final artifacts. The whole point of the new max_categories field seems to be to enable this kind of use on high-cardinality fields (if we use it solely for low-cardinality columns, why do we even need to limit the number of categories?)
Describe the workflow you want to enable
It seems that OneHotEncoder has a parameter for setting
handle_unknown='infrequent_if_exist'
but the same is missing in OrdinalEncoder . Currentlyunknown_value
and the value encoded by setting the parametermin_frequency
seems to be different. There is always workaround to figure out the encoded value onmin_frequency
and pass the same tounknown_values
but I think having something similar to OneHotEncoder's parameterhandle_unknown='infrequent_if_exist'
seems intuitive as we would want to treat unseen values as infrequent ones. Not sure if this feature already exists and I'm missing it somehow.Describe your proposed solution
Implement parameter option similar to OneHotEncoder's parameter
handle_unknown='infrequent_if_exist'
where unknown (unseen values during training) get similar encoding as happened for infrequent_categories during training.Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: