8000 Please provide option to set unknown_values during test time to same as encoded min_frequency in OrdinalEncoder(Infrequent categories) · Issue #27629 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Please provide option to set unknown_values during test time to same as encoded min_frequency in OrdinalEncoder(Infrequent categories) #27629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
abhishek0093 opened this issue Oct 20, 2023 · 2 comments
Labels
Hard Hard level of difficulty help wanted New Feature

Comments

@abhishek0093
Copy link
abhishek0093 commented Oct 20, 2023

Describe the workflow you want to enable

It seems that OneHotEncoder has a parameter for setting handle_unknown='infrequent_if_exist' but the same is missing in OrdinalEncoder . Currently unknown_value and the value encoded by setting the parameter min_frequency seems to be different. There is always workaround to figure out the encoded value on min_frequency and pass the same to unknown_values but I think having something similar to OneHotEncoder's parameter handle_unknown='infrequent_if_exist' seems intuitive as we would want to treat unseen values as infrequent ones. Not sure if this feature already exists and I'm missing it somehow.

Describe your proposed solution

Implement parameter option similar to OneHotEncoder's parameter handle_unknown='infrequent_if_exist' where unknown (unseen values during training) get similar encoding as happened for infrequent_categories during training.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@abhishek0093 abhishek0093 added Needs Triage Issue requires triage New Feature labels Oct 20, 2023
@adrinjalali adrinjalali added help wanted Hard Hard level of difficulty and removed Needs Triage Issue requires triage labels Dec 1, 2023
@adrinjalali
Copy link
Member

Sounds reasonable, if there's a PR for it, happy to review.

@eugeneyarovoi
Copy link
eugeneyarovoi commented Mar 5, 2024

If you eliminate the distinction between infrequent categories and the missing category, there is also no need for the infrequent_categories_ attribute, which can require a lot of storage on a large dataset where many of the infrequent values are unique. Having an option to not store those categories at all within the encoder's state would make it much smaller when pickled. This is a rather important change for being able to use OrdinalEncoder with confidence on high-cardinality columns (where some of the values may still be frequent) without fearing it will massively blow up the size of the final artifacts. The whole point of the new max_categories field seems to be to enable this kind of use on high-cardinality fields (if we use it solely for low-cardinality columns, why do we even need to limit the number of categories?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hard Hard level of difficulty help wanted New Feature
Projects
Development

No branches or pull requests

3 participants
0