Add "other" / min_frequency option to OneHotEncoder #12153

amueller · 2018-09-24T21:00:40Z

The OneHotEncoder should have an option to summarize categories that are not frequent - or we should have another transformer to do that beforehand.
Generally having a maximum number of categories or having a minimum frequency per category would make sense as thresholds. This is similar to what we're doing in CountVectorizer but I think common enough for categorical variables that we should explicitly make it easy to do.

datajanko · 2018-09-25T18:15:32Z

Actually, I once implemented something along those lines and would like to work on that.
I also build a transformer that was able to perform a linear regression on groups. For this purpose it seems to be to much overhead to one hot-encoder those variables. There are also cases where one would like to perform the OneHotEncoding as late as possible in a pipeline.

So I'd vote for a separate transformer that simply is called in OneHotEncoder if necessary

jnothman · 2018-09-26T08:20:41Z

8000

PR welcome, Jan. Keep the options simple for now, please

datajanko · 2018-09-26T19:43:47Z

I'll work on this and try to provide a PR as soon as possible.

I just read the _encoders.py source code and it seems that also the OrdinalEncoder would profit directly. As both OneHot- and OrdinalEncoder inherit from _BaseEncoder where we call _encode (which could can numpy's unique which can return counts), I'd start tackling the issue from there.

Would that make sense?

jnothman · 2018-09-26T23:45:22Z

Yes, that sounds sensible

datajanko · 2018-10-02T21:23:34Z

Just a quick update:

I'm still working on this, but I feel that a WIP Pull request is still to early.

In introduced a function _group_values(values, min_freq=0, groups=None), which calls for dtype not object

def _group_values_numpy(values, min_freq=0, group=None):
    if min_freq and group:
        raise ValueError
    if min_freq:
        uniques, counts = np.unique(values, return_counts=True)
        mask = (counts/len(values) < min_freq)
        group = uniques[mask]
    if group is not None:
        try:
            values[np.isin(values, group)] = group[0]
        except IndexError:
            pass
        return values, group
    else:
        return values, None

essentially, this function could be plugged in in front of e.g. the _encodecalls in BaseEncoder.
This leads to a question in the _fitfunction, which only calls _encode if categories are inferred automatically. I`d assume, that we don't want to apply a frequency threshold if we provide the categories manually, right?

Also, it would be nice to set the 'other' label as desired (currently the first element of the grouped items is set as the representative element, if one such element is found).

So what is the minimal quality a WIP-PR should have? (I have some tests for _group_values btw)
Please apologize my slowness. Feel free to provide any comment, as I'm here to learn.

jorisvandenbossche · 2018-10-04T11:21:29Z

I think we should have more discussion about what we want exactly as behaviour.

On the PR (#12264 (comment)) I raised the question about "combining all non-frequent categories in one separate category" vs "putting all non-frequent categories into an existing (first) category", which seems a quite important difference from a usage point of view.

Also, if we would consider multiple ways or multiple options, I think it might become too much for putting in the OneHotEncoder. Eg if you want the option to either specify minimum frequency as maximum number of categories, the option how to name the new category (or in which existing category to put it, see first question), or the option to replace them with missing value, etc ... it might give some "keyword noise" to put all of that in the Ordinal/OneHotEncoder, and it might warrant a separate transformer class.
(but which has clear downside of not being able to share the factorization step)

amueller · 2018-10-04T15:08:54Z

I think putting them into a new category is more common in "ML". But we could also just have a parameter that is others_col='name_of_category' and if that level already exists we add to it, if not, it's created?

amueller · 2018-10-04T15:09:55Z

We could put this code in a different class but then add a convenience call into OneHotEncoder? That might make it also more discoverable.

datajanko · 2018-10-04T17:15:12Z

I'd also vote (again) for a separate class. One more argument for a separate class: one-hot and ordinal encoder are not necessarily invertible using the "frequency cut off"-implementation.

Appending to an existing category is fine for me as well.
How do we want to treat already integer encoded classes? Append there as well or add to let's say min(classes) -1. What shall we do with float labels? (Not common but might appear).

amueller · 2018-10-04T19:38:09Z

I like modularity but I also like convenience ;)

jorisvandenbossche · 2018-10-04T21:40:01Z

I also think that a separate class makes the most sense (at least for the default).

We could put this code in a different class but then add a convenience call into OneHotEncoder? That might make it also more discoverable.

How would you see this in practice API-wise? (I think it would be nice, but just not sure how it would look like)
One keyword to trigger using this other class with its default values? Or a way to pass through arguments?

datajanko · 2018-10-05T05:45:07Z

I would suggest to pass through the arguments. Then in fit and transform, as a first step one alters X by applying the transformer to it and returning a copy, if necessary. I think using the class with default values will not be flexible enough. Then, in my opinion, it would be better to pass the class object directly. But for the user this would be more cumbersome

amueller · 2018-10-05T15:22:49Z

One keyword to trigger using this other class with its default values? Or a way to pass through arguments?

Yes, basically as for #12294 (though that one has less options).
Here we have two parameters, right? a threshold and a column name?
Adding both to OneHotEncoder might not be terrible, and the threshold is a natural way to enable the feature (i.e. have a threshold different from None or something). Though: is the threshold float or int? Generally we try to avoid changing semantics with the type, so I guess we would have a min_frequency and a min_count parameter for the threshold, and maybe max_levels (max_categories?)
Adding all these to OneHotEncoder might be a bit much.

(sorry for rambling lol)

datajanko · 2018-10-05T17:45:46Z

So for me, I'd suggest to construct the new transformer. I'd vote for something like GroupTransformer, GroupEncoder. With that name, we might even pass dictionaries in the future to group certain values together under one class.

Concerning OneHot and Ordinal-Encoder: What about adding simply a pre_transformer=None keyword (or similar) and add **kwargsand pass those keyword arguments to the pre_transformer class?

amueller · 2018-10-05T18:48:01Z

yeah let's just go with a separate transformer for now.

jnothman · 2018-10-07T02:59:09Z

I think I'd prefer extending OneHotEncoder and OrdinalEncoder, for visibility and perhaps because we should have different implementations for each. Also, note that as a post-processor, this can be done for OneHotEncoder with univariate feature selection. I'm okay with allowing parameter semantics to be determined by type (int vs float) as long as the denominator is obvious and fractions greater than or equal to one make no sense. So for something like number of samples in resampling, which could be less than, equal to or greater than 100%, determination by type is bad because 1. and 1 would mean different things. That is, `x = int(x/d) if 0 < x < 1 else x` tends to be safe (subject to considering rounding).

datajanko · 2018-10-07T20:06:07Z

Please apologize my inexperience, but how is the decision made on what to do next? Follow @amueller's suggestion of writing a new class or following @jnothman's suggestions of building the functionality directly (if I understand it correctly) to build the functionality directly into Ordinal- and OneHotEncoder? Or @jnothman was your comment only targeting adding (many) parameters to the classes.

datajanko · 2018-11-02T21:01:38Z

Sorry for the long delay here and not pushing the discussion further.
I just had some time to modify my approach to a single class:

class GroupingTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, *, min_count=0, min_freq=0.0, top_n=0, group_name='other'):
        self.min_count = min_count
        self.min_freq = min_freq
        self.top_n = top_n
        self.group_name='other'
        
    def fit(self, X, y=None):
        n_samples, n_features = X.shape
        counts = []
        groups = []
        other_in_keys = []
        for i in range(n_features):
            cnts = Counter(X[:, i])
            counts.append(cnts)
            if self.top_n == 0:
                self.top_n = len(cnts)
            labels_to_group = (label for rank, (label, count) in enumerate(cnts.most_common())
                                 if ((count < self.min_count) 
                                 or (count/n_samples < self.min_freq)
                                 or (rank >= self.top_n)
                                    )
                              )
            groups.append(np.array(sorted(set(labels_to_group))))
            other_in_keys.append(self.group_name in cnts.keys())
        self.counts_ = counts
        self.groups_ = groups
        self.other_in_keys_ = other_in_keys
        return self
    
    def transform(self, X):
        X_t = X.copy()
        _, n_features = X.shape
        for i in range(n_features):
            mask = np.isin(X_t[:, i], self.groups_[i])
            X_t[mask, i] = self.group_name
        return X_t

This approach supports min_count, min_freq and top_n. This is still work in progress and I'm not optimizing for other array dtypes like int and float.

With

val1 = np.array([['1', '1'], ['2', '3'], ['1', '2'], ['3', '2'], ['1', '5']], dtype=object)
val2 = np.array([['1', '1'], ['2', '3'], ['1', '2'], ['3', '2'], ['1', '4']])
GT = GroupingTransformer(top_n=2, min_freq=.3)

we obtain

GT.fit_transform(val1)
array([['1', 'other'],
       ['other', 'other'],
       ['1', '2'],
       ['other', '2'],
       ['1', 'other']], dtype=object)

and

GT.fit_transform(val2)
array([['1', 'o'],
       ['o', 'o'],
       ['1', '2'],
       ['o', '2'],
       ['1', 'o']], dtype='<U1')

Right now, I'm just collecting if the "other"-group name is contained in the labels.
Additionally, we see that the numpy string array casts our group name "other" to "o". Typically, this is not what we want. This might necessitate to cast the dtype to a different type or at least modify the state of "other_in_keys".

Moreover, we are only appending to the other group. Nevertheless, I think it should be rather easy to find a new unique column name, of the form other + _ + column_number/other_name + _ + some_fitting_digit.

So questions we need to clarify:

Which dtypes do we want to handle?
Do we want to allow individual parameters for every column?
How do we want to generate a proper other name. Specifically when thinking of int and float arrays?

Any suggestions? Further comments?

jnothman · 2018-11-05T07:33:53Z

if we support NaN in OneHotEncoder and OrdinalEncoder, I would consider making the other group NaN by default, as its handling downstream would be similar to missing values. Btw your code will not be deterministic for breaking ties in identifying the top n. I'm not yet convinced by this separate transformer. It makes things somewhat less convenient than just handling the usual min_freq (absolute or relative) case in the existing transformers. Top n categories can usually be identified reliably without worrying about frequencies under cross validation and without the mess of arbitrary tie breaking.

datajanko · 2018-11-05T07:52:21Z

Of course you are right about the non-determinism when breaking ties. I just wanted to show how it could look like and more work needs to be done there.

One Enhancement of the GroupingEncoder, as I had in mind, would be to pass something like: {'label_a': grouped1, 'label_b': 'grouped1', 'label_c': 'grouped_2', 'label_d': 'grouped_2'}
However, this could be done at other preprocessing steps.

So is the real issue then, to implement NaN first? How will this work with integer encoded data?

jnothman · 2018-11-05T22:52:22Z

If integer + NaN, convert to float! I think/hope it should be a relatively safe cast in this context...

One Enhancement of the GroupingEncoder, as I had in mind, would be to pass something like: {'label_a': grouped1, 'label_b': 'grouped1', 'label_c': 'grouped_2', 'label_d': 'grouped_2'}
However, this could be done at other preprocessing steps.

A big -1 from me. What we don't need is more stateless transformers to do arbitrary preprocessing. That's what FunctionTransformer is for. We can't hope to provide an API for a vast range of data transformation needs. It's just not maintainable.

I would rather something with a lightweight API here.

amueller · 2018-11-05T23:03:36Z

Happy to have this in the OHE transformer. Sorry I've been busy.

amueller · 2018-11-05T23:04:08Z

Basically conditioned on the fact that we can have a lightweight API ;)

amueller · 2019-11-22T22:20:33Z

btw there was a discussion on this topic earlier this year and Sole Galli cited some things from the 2009 KDD competition http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf:

Page 4 of the summary and introductory article:
"For categorical variables, grouping of under-represented categories proved to be useful  to avoid overfitting. The winners of the fast and the slow track used similar strategies consisting in retaining the most populated categories and coarsely grouping the others in  an unsupervised way"

Page 23:
"Most of the learning algorithms we were planning to use do not handle categorical variables,  so we needed to recode them. This was done in a standard way, by generating indicator vari-  ables for the different values a categorical attribute could take. The only slightly non-standard  decision was to limit ourselves to encoding only the 10 most common values of each categorical  attribute, rather than all the values, in order to avoid an explosion in the number of features from  variables with a huge vocabulary"

Page 36:
"We consolidate the extremely low populated entries  (having fewer than 200 examples) with their neighbors to smooth out the outliers. Similarly, we  group some categorical variables which have a large number of entries (  >  1000 distinct values)  into 100 categories."

See bulletpoints in Page 47

These might provide some additional pointers / references.

rth · 2019-11-22T23:40:30Z

I currently have a preliminary preprocessor (e.g. a CategoricalPreprocessor) for infrequent categories and NaN handling that I could contribute. Though it only works with dataframes (and relies heavily on pd.Categorical) so maybe it would be more a thing for sklearn-extra.

It could be worthwhile having a separate estimator for this maybe, even if we keep it private in the end, if only to share code between OHE and OrdinalEncoder?

I also agree with @glemaitre in #14953 (comment) that it would be good to have a meta issue on the behavior we want for categories in OHE. Right now there are multiple issues and PRs that pull it in different directions (with partially overlapping discussions).

thomasjpfan · 2019-12-04T18:18:48Z

@rth I really do want to see all the issues surrounding encoders resolve for 0.23. I do keep a list of all the encoder issues and PRs revolving them. Although, I think placing all the issues into one meta issue may be kind of overwhelming. I think the issue around missing value has the most overlapping PRs and ideas.

After speaking with @amueller regarding infrequent categories, I propose this API:

Infrequent One hot encoder

ohe = OneHotEncoder(sparse=False, handle_unknown='error', min_frequency=10)
X_train = np.array([['a'] * 5 + ['b'] * 20 + ['c'] * 15 + ['d'] * 2]).T
ohe.fit(X_train)

X_test = np.array([['a', 'b', 'c', 'd']]).T
ohe.tranform(X_test)
# [[0, 0, 1],
#  [1, 0, 0],
#  [0, 1, 0],
#  [0, 0, 1]]

X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)  # errors because there is an unknown

Change handle_unknown to 'auto'

ohe = OneHotEncoder(sparse=False, handle_unknown='auto', min_frequency=10)
ohe.fit(X_train)

X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)
# [[0, 0, 1],
#  [0, 0, 1]]  # unknown is mapped infrequent bin

No infrequent categories in training

ohe = OneHotEncoder(sparse=False, handle_unknown='auto', min_frequency=10)
X_train = np.array([['a'] * 11 + ['b'] * 20 + ['c'] * 15 + ['d'] * 12]).T
ohe.fit(X_train)

X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)
# [[1, 0, 0, 0],
#  [0, 0, 0, 0]]  # the same as ignore

This API uses 'auto' to mean the following:

If there are infrequent categories during fit, during transform, unknown values are mapped to that bin.
If there are no infrequent categories during fit, during transform, unkwnon values will have all zeros. This is the same behavior as handle_unknown='ignore'.

With this 'auto', we can deprecate 'ignore'.

jnothman · 2019-12-04T20:34:33Z

Looks reasonable

rth · 2019-12-04T22:17:48Z

I really do want to see all the issues surrounding encoders resolve for 0.23. I do keep a list of all the encoder issues and PRs revolving them.

It would be definitely good to have this addressed for the next release. I tried to write a summary in #15796 but you (and other people in this thread) have spent much more time on so don't hesitate to comment/edit if I missed or misunderstood something.

catherinenelson1 · 2021-03-19T05:04:54Z

I would love to see a min_frequency option in the OneHotEncoder and would use it frequently :-) thanks for all the work on this, and looking forward to seeing it released in the future.

zas97 · 2021-05-19T10:20:00Z

A hack to create a transformer that does exactly OneHotEncoding with frequency filtering is to use

CountVectorizer(lowercase=False, min_df = 0.01, max_df=1., token_pattern=r"^.*$")

Where min_df is the min fequency and max_df is the max fequency

glmcdona · 2021-11-29T17:28:55Z

This would be a really useful feature. @thomasjpfan, by any chance are you still working on getting this feature merged? Anything you'd like a hand with?

thomasjpfan · 2021-11-29T20:12:59Z

Currently, #16018 is basically complete and needs a second approval from a maintainer.

datajanko mentioned this issue Oct 3, 2018

[WIP] "other"/min_freq in OneHot and OrdinalEncoder #12264

Closed

7 tasks

amueller mentioned this issue Oct 29, 2018

Is there any encoder for rare events? #12477

Closed

amueller mentioned this issue Apr 2, 2019

Handle Error Policy in OrdinalEncoder #13488

Closed

rragundez mentioned this issue May 6, 2019

[WIP] OrdinalEncoder functionality for unknown categories when transforming #13808

Closed

NicolasHug mentioned this issue May 8, 2019

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

Closed

4 tasks

thomasjpfan self-assigned this Nov 22, 2019

rth mentioned this issue Dec 4, 2019

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Open

thomasjpfan mentioned this issue Jan 3, 2020

ENH Adds infrequent categories to OneHotEncoder #16018

Merged

cmarmo added the module:preprocessing label Feb 5, 2022

adrinjalali closed this as completed in #16018 Mar 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "other" / min_frequency option to OneHotEncoder #12153

Add "other" / min_frequency option to OneHotEncoder #12153

Add "other" / min_frequency option to OneHotEncoder #12153

Add "other" / min_frequency option to OneHotEncoder #12153

Comments

Infrequent One hot encoder

Change handle_unknown to 'auto'

No infrequent categories in training