-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add "other" / min_frequency option to OneHotEncoder #12153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Actually, I once implemented something along those lines and would like to work on that. So I'd vote for a separate transformer that simply is called in OneHotEncoder if necessary |
PR welcome, Jan. Keep the options simple for now, please
|
I'll work on this and try to provide a PR as soon as possible. I just read the _encoders.py source code and it seems that also the OrdinalEncoder would profit directly. As both OneHot- and OrdinalEncoder inherit from _BaseEncoder where we call _encode (which could can numpy's unique which can return counts), I'd start tackling the issue from there. Would that make sense? |
Yes, that sounds sensible
|
Just a quick update: I'm still working on this, but I feel that a WIP Pull request is still to early. In introduced a function
essentially, this function could be plugged in in front of e.g. the Also, it would be nice to set the 'other' label as desired (currently the first element of the grouped items is set as the representative element, if one such element is found). So what is the minimal quality a WIP-PR should have? (I have some tests for |
I think we should have more discussion about what we want exactly as behaviour. On the PR (#12264 (comment)) I raised the question about "combining all non-frequent categories in one separate category" vs "putting all non-frequent categories into an existing (first) category", which seems a quite important difference from a usage point of view. Also, if we would consider multiple ways or multiple options, I think it might become too much for putting in the OneHotEncoder. Eg if you want the option to either specify minimum frequency as maximum number of categories, the option how to name the new category (or in which existing category to put it, see first question), or the option to replace them with missing value, etc ... it might give some "keyword noise" to put all of that in the Ordinal/OneHotEncoder, and it might warrant a separate transformer class. |
I think putting them into a new category is more common in "ML". But we could also just have a parameter that is |
We could put this code in a different class but then add a convenience call into |
I'd also vote (again) for a separate class. One more argument for a separate class: one-hot and ordinal encoder are not necessarily invertible using the "frequency cut off"-implementation. Appending to an existing category is fine for me as well. |
I like modularity but I also like convenience ;) |
I also think that a separate class makes the most sense (at least for the default).
How would you see this in practice API-wise? (I think it would be nice, but just not sure how it would look like) |
I would suggest to pass through the arguments. Then in |
Yes, basically as for #12294 (though that one has less options). (sorry for rambling lol) |
So for me, I'd suggest to construct the new transformer. I'd vote for something like GroupTransformer, GroupEncoder. With that name, we might even pass dictionaries in the future to group certain values together under one class. Concerning OneHot and Ordinal-Encoder: What about adding simply a |
yeah let's just go with a separate transformer for now. |
I think I'd prefer extending OneHotEncoder and OrdinalEncoder, for
visibility and perhaps because we should have different implementations for
each. Also, note that as a post-processor, this can be done for
OneHotEncoder with univariate feature selection.
I'm okay with allowing parameter semantics to be determined by type (int vs
float) as long as the denominator is obvious and fractions greater than or
equal to one make no sense. So for something like number of samples in
resampling, which could be less than, equal to or greater than 100%,
determination by type is bad because 1. and 1 would mean different things.
That is, `x = int(x/d) if 0 < x < 1 else x` tends to be safe (subject to
considering rounding).
|
Please apologize my inexperience, but how is the decision made on what to do next? Follow @amueller's suggestion of writing a new class or following @jnothman's suggestions of building the functionality directly (if I understand it correctly) to build the functionality directly into Ordinal- and OneHotEncoder? Or @jnothman was your comment only targeting adding (many) parameters to the classes. |
Sorry for the long delay here and not pushing the discussion further.
This approach supports With
we obtain
and
Right now, I'm just collecting if the "other"-group name is contained in the labels. Moreover, we are only appending to the other group. Nevertheless, I think it should be rather easy to find a new unique column name, of the form other + _ + column_number/other_name + _ + some_fitting_digit. So questions we need to clarify:
Any suggestions? Further comments? |
if we support NaN in OneHotEncoder and OrdinalEncoder, I would consider
making the other group NaN by default, as its handling downstream would be
similar to missing values.
Btw your code will not be deterministic for breaking ties in identifying
the top n.
I'm not yet convinced by this separate transformer. It makes things
somewhat less convenient than just handling the usual min_freq (absolute or
relative) case in the existing transformers. Top n categories can usually
be identified reliably without worrying about frequencies under cross
validation and without the mess of arbitrary tie breaking.
|
Of course you are right about the non-determinism when breaking ties. I just wanted to show how it could look like and more work needs to be done there. One Enhancement of the GroupingEncoder, as I had in mind, would be to pass something like: So is the real issue then, to implement NaN first? How will this work with integer encoded data? |
If integer + NaN, convert to float! I think/hope it should be a relatively safe cast in this context...
A big -1 from me. What we don't need is more stateless transformers to do arbitrary preprocessing. That's what FunctionTransformer is for. We can't hope to provide an API for a vast range of data transformation needs. It's just not maintainable. I would rather something with a lightweight API here. |
Happy to have this in the OHE transformer. Sorry I've been busy. |
Basically conditioned on the fact that we can have a lightweight API ;) |
btw there was a discussion on this topic earlier this year and Sole Galli cited some things from the 2009 KDD competition http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf:
These might provide some additional pointers / references. |
I currently have a preliminary preprocessor (e.g. a It could be worthwhile having a separate estimator for this maybe, even if we keep it private in the end, if only to share code between OHE and OrdinalEncoder? I also agree with @glemaitre in #14953 (comment) that it would be good to have a meta issue on the behavior we want for categories in OHE. Right now there are multiple issues and PRs that pull it in different directions (with partially overlapping discussions). |
@rth I really do want to see all the issues surrounding encoders resolve for 0.23. I do keep a list of all the encoder issues and PRs revolving them. Although, I think placing all the issues into one meta issue may be kind of overwhelming. I think the issue around missing value has the most overlapping PRs and ideas. After speaking with @amueller regarding infrequent categories, I propose this API: Infrequent One hot encoderohe = OneHotEncoder(sparse=False, handle_unknown='error', min_frequency=10)
X_train = np.array([['a'] * 5 + ['b'] * 20 + ['c'] * 15 + ['d'] * 2]).T
ohe.fit(X_train)
X_test = np.array([['a', 'b', 'c', 'd']]).T
ohe.tranform(X_test)
# [[0, 0, 1],
# [1, 0, 0],
# [0, 1, 0],
# [0, 0, 1]]
X_test = np.array([['a', 'e']]).T
ohe.transform(X_test) # errors because there is an unknown Change handle_unknown to 'auto'ohe = OneHotEncoder(sparse=False, handle_unknown='auto', min_frequency=10)
ohe.fit(X_train)
X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)
# [[0, 0, 1],
# [0, 0, 1]] # unknown is mapped infrequent bin No infrequent categories in trainingohe = OneHotEncoder(sparse=False, handle_unknown='auto', min_frequency=10)
X_train = np.array([['a'] * 11 + ['b'] * 20 + ['c'] * 15 + ['d'] * 12]).T
ohe.fit(X_train)
X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)
# [[1, 0, 0, 0],
# [0, 0, 0, 0]] # the same as ignore This API uses
With this ' |
Looks reasonable
|
It would be definitely good to have this addressed for the next release. I tried to write a summary in #15796 but you (and other people in this thread) have spent much more time on so don't hesitate to comment/edit if I missed or misunderstood something. |
I would love to see a min_frequency option in the OneHotEncoder and would use it frequently :-) thanks for all the work on this, and looking forward to seeing it released in the future. |
A hack to create a transformer that does exactly OneHotEncoding with frequency filtering is to use
Where min_df is the min fequency and max_df is the max fequency |
This would be a really useful feature. @thomasjpfan, by any chance are you still working on getting this feature merged? Anything you'd like a hand with? |
Currently, #16018 is basically complete and needs a second approval from a maintainer. |
The OneHotEncoder should have an option to summarize categories that are not frequent - or we should have another transformer to do that beforehand.
Generally having a maximum number of categories or having a minimum frequency per category would make sense as thresholds. This is similar to what we're doing in CountVectorizer but I think common enough for categorical variables that we should explicitly make it easy to do.
The text was updated successfully, but these errors were encountered: