-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Handle Error Policy in OrdinalEncoder #13488
New issue
Comments
Would #12264 meet your use cases? What specific behaviour do you seek? What
use cases?
|
No, #12264 is not my case but it is desirable too. I would like |
What would it do rather than throw an error?
|
I could ignore unknown catregiries value as |
The policy of Allowing a fallback category value seems reasonable to me. @daskol could you please describe your particular use-case that motivates this change? |
We discussed the issue with @jorisvandenbossche and I think the sanest strategy would be to have:
IMO |
Personally I find this issue really annoying. At the moment we cannot use |
also see #12153 for something related. I think min_frequency is good, but I also want max_levels or something like that. Basically we could reimplement all the different pruning options we have in CountVectorizer.... |
If we add a “rare” category to Ideally, if OrdinalEncoder can handle most of the logic that has to deal with unknown and infrequent categories, OneHotEncoder would need to do less. (I am thinking of composition ie |
Hi! I wanted to add that it seems to me that the feature of allowing the encoder to be fit on data that contains some categories, and then applied to data that contains maybe an additional category or two, seems like a common use case for all kinds of categorical data. Even if one category isn't very uncommon, if you're doing a random split for training and validation it's only a matter of time before this error comes up. I like the "min_frequency" solution for its generality, but to (naive) me, it seems too complicated. To me, it seems the default behavior should be to send all categories not present in the original fitting to a single virtual category. Or maybe a "create_virtual_category = True" option. If this is amenable, I'd be happy to take a crack at making it, I'm trying to spend more time working on open source code! |
Nathaniel, should the virtual category be the first or the last category?
Here we presume order matters...
|
I didn't pick that up from the docs.. my understanding was that the encoder assigned essentially arbitrary integer values to each category, but you're saying it assigns them maybe based on frequency or something else which means that order matters? If that's the case, I'd like to suggest an update to the docs because I didn't quite understand that from the docs. If it's not, then couldn't we just stick the virtual category at the end of the original ordinal categories? |
No, I just mean that if you want an ordinal encoding, rather than one-hot,
it will often be because order matters. You're certainly welcome to help us
improve the documentation by submitting specific changes as a pull request.
|
I think the @ogrisel iirc you were one of the people that wanted this encoder. Can you give examples of your motivation? |
I totally agree @amueller. I was using it for preprocessing data before LightGBM, which requires integer data, but you can flag certain columns as categorical and so the order does NOT matter. I am confused about what "order matters" means because by its very nature, there is no correct order of the classes. I'm not sure that an "order matters" OrdinalEncoder makes sense to me, when no order is specified. |
I think we never intended the behaviour to always require lexicographic
ordering... we just haven't got around to fixing that (taking order
specified by user, or from a pandas categorical dtype ordering). Should
that be a priority?
|
Wait... you aren't constrained to lexicographic ordering in OrdinalEncoder:
OrdinalEncoder([['S', 'M', 'L', 'XL']]) works.
|
I think we should force the user to pass the order of the categories when they are strings. I opened #14563 |
Friends, may you speed up this fix Can you do at least this as @daskol wrote above yes just ignore unknown catregiries for transform - replace unknown value with default one which could be specified in OrdinalEncoder's constructor |
Does the order mean anything in your use case, or is ordinal encoding just
a way of encoding string-valued features?
|
Order is not important |
I think that this would be a really helpful change. Given the choice of a default virtual encoding for unseen values, I would rather it be the first value as opposed to the last value. I feel that this is intuitive because you always know where the first value is without looking (e.g. if the default encoding were -1 or 0). On a practical use case, I am converting categorical data to numeric data to use with the RandomForestRegressor. Some of my categories are quite small (~5 types) while some are larger (~5000 types). I would like to use the OneHotEncoder on my smaller types because it will make my features more interpretable when I look at their permutation importance. I would like to use the OrdinalEncoder on my larger categories because it will make actions like permutation importance less computationally expensive and my model choice will be robust to the effect of the OrdinalEncoder's choice of ordering the features. However, I will no doubt encounter a number of examples in my test data for the larger categories that are not present in my training dataset, so if I use the OrdinalEncoder it will throw an error. My other alternative is to do something hacky with Pandas or the DictVectorizer. |
That's not true though. Often times people use ordinalencoder because onehotencoder would expand the feature array too much and so filling in a variable with arbitrary numbers is a more space-efficient option that doesn't make too much of a difference in tree ensembles. |
I encountered a similar need. I was hoping to use sklearn to encode categorical features prior to passing them to an Embedding layer in keras. In this case, the order of the categorical features does not matter, and it would be helpful to have the ability to make any unknown categories encountered at My solution for myself was to make a new encoder subclass I called CardinalEncoder. Its My solution is at: |
Great idea |
jdraines/cardinal_encoder: Implements a Scikit-Learn CardinalEncoder which differs from OrdinalEncoder in that it handles unknowns. |
Preprocessor class
OneHotEncoder
allows transformation if unknown values are found. It would be great to introduce the same option toOrdinalEncoder
. It seems simple to do sinceOrdinalEncoder
(as well asOneHotEncoder
) is derived from_BaseEncoder
which actually implements handling error policy.The text was updated successfully, but these errors were encountered: