8000 Add "other" / min_frequency option to OneHotEncoder · Issue #12153 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add "other" / min_frequency option to OneHotEncoder #12153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Sep 24, 2018 · 32 comments · Fixed by #16018
Closed

Add "other" / min_frequency option to OneHotEncoder #12153

amueller opened this issue Sep 24, 2018 · 32 comments · Fixed by #16018
Assignees

Comments

@amueller
Copy link
Member

The OneHotEncoder should have an option to summarize categories that are not frequent - or we should have another transformer to do that beforehand.
Generally having a maximum number of categories or having a minimum frequency per category would make sense as thresholds. This is similar to what we're doing in CountVectorizer but I think common enough for categorical variables that we should explicitly make it easy to do.

@datajanko
Copy link
Contributor

Actually, I once implemented something along those lines and would like to work on that.
I also build a transformer that was able to perform a linear regression on groups. For this purpose it seems to be to much overhead to one hot-encoder those variables. There are also cases where one would like to perform the OneHotEncoding as late as possible in a pipeline.

So I'd vote for a separate transformer that simply is called in OneHotEncoder if necessary

@jnothman
Copy link
Member
jnothman commented Sep 26, 2018 via email
8000

@datajanko
Copy link
Contributor
datajanko commented Sep 26, 2018

I'll work on this and try to provide a PR as soon as possible.

I just read the _encoders.py source code and it seems that also the OrdinalEncoder would profit directly. As both OneHot- and OrdinalEncoder inherit from _BaseEncoder where we call _encode (which could can numpy's unique which can return counts), I'd start tackling the issue from there.

Would that make sense?

@jnothman
Copy link
Member
jnothman commented Sep 26, 2018 via email

@datajanko
Copy link
Contributor
datajanko commented Oct 2, 2018

Just a quick update:

I'm still working on this, but I feel that a WIP Pull request is still to early.

In introduced a function _group_values(values, min_freq=0, groups=None), which calls for dtype not object

def _group_values_numpy(values, min_freq=0, group=None):
    if min_freq and group:
        raise ValueError
    if min_freq:
        uniques, counts = np.unique(values, return_counts=True)
        mask = (counts/len(values) < min_freq)
        group = uniques[mask]
    if group is not None:
        try:
            values[np.isin(values, group)] = group[0]
        except IndexError:
            pass
        return values, group
    else:
        return values, None

essentially, this function could be plugged in in front of e.g. the _encodecalls in BaseEncoder.
This leads to a question in the _fitfunction, which only calls _encode if categories are inferred automatically. I`d assume, that we don't want to apply a frequency threshold if we provide the categories manually, right?

Also, it would be nice to set the 'other' label as desired (currently the first element of the grouped items is set as the representative element, if one such element is found).

So what is the minimal quality a WIP-PR should have? (I have some tests for _group_values btw)
Please apologize my slowness. Feel free to provide any comment, as I'm here to learn.

@jorisvandenbossche
Copy link
Member

I think we should have more discussion about what we want exactly as behaviour.

On the PR (#12264 (comment)) I raised the question about "combining all non-frequent categories in one separate category" vs "putting all non-frequent categories into an existing (first) category", which seems a quite important difference from a usage point of view.

Also, if we would consider multiple ways or multiple options, I think it might become too much for putting in the OneHotEncoder. Eg if you want the option to either specify minimum frequency as maximum number of categories, the option how to name the new category (or in which existing category to put it, see first question), or the option to replace them with missing value, etc ... it might give some "keyword noise" to put all of that in the Ordinal/OneHotEncoder, and it might warrant a separate transformer class.
(but which has clear downside of not being able to share the factorization step)

@amueller
Copy link
Member Author
amueller commented Oct 4, 2018

I think putting them into a new category is more common in "ML". But we could also just have a parameter that is others_col='name_of_category' and if that level already exists we add to it, if not, it's created?

@amueller
Copy link
Member Author
amueller commented Oct 4, 2018

We could put this code in a different class but then add a convenience call into OneHotEncoder? That might make it also more discoverable.

@datajanko
Copy link
Contributor

I'd also vote (again) for a separate class. One more argument for a separate class: one-hot and ordinal encoder are not necessarily invertible using the "frequency cut off"-implementation.

Appending to an existing category is fine for me as well.
How do we want to treat already integer encoded classes? Append there as well or add to let's say min(classes) -1. What shall we do with float labels? (Not common but might appear).

@amueller
Copy link
Member Author
amueller commented Oct 4, 2018

I like modularity but I also like convenience ;)

@jorisvandenbossche
Copy link
Member

I also think that a separate class makes the most sense (at least for the default).

We could put this code in a different class but then add a convenience call into OneHotEncoder? That might make it also more discoverable.

How would you see this in practice API-wise? (I think it would be nice, but just not sure how it would look like)
One keyword to trigger using this other class with its default values? Or a way to pass through arguments?

@datajanko
Copy link
Contributor

I would suggest to pass through the arguments. Then in fit and transform, as a first step one alters X by applying the transformer to it and returning a copy, if necessary. I think using the class with default values will not be flexible enough. Then, in my opinion, it would be better to pass the class object directly. But for the user this would be more cumbersome

@amueller
Copy link
Member Author
amueller commented Oct 5, 2018

One keyword to trigger using this other class with its default values? Or a way to pass through arguments?

Yes, basically as for #12294 (though that one has less options).
Here we have two parameters, right? a threshold and a column name?
Adding both to OneHotEncoder might not be terrible, and the threshold is a natural way to enable the feature (i.e. have a threshold different from None or something). Though: is the threshold float or int? Generally we try to avoid changing semantics with the type, so I guess we would have a min_frequency and a min_count parameter for the threshold, and maybe max_levels (max_categories?)
Adding all these to OneHotEncoder might be a bit much.

(sorry for rambling lol)

@datajanko
Copy link
Contributor

So for me, I'd suggest to construct the new transformer. I'd vote for something like GroupTransformer, GroupEncoder. With that name, we might even pass dictionaries in the future to group certain values together under one class.

Concerning OneHot and Ordinal-Encoder: What about adding simply a pre_transformer=None keyword (or similar) and add **kwargsand pass those keyword arguments to the pre_transformer class?

@amueller
Copy link
Member Author
amueller commented Oct 5, 2018

yeah let's just go with a separate transformer for now.

@jnothman
Copy link
Member
jnothman commented Oct 7, 2018 via email

@datajanko
Copy link
Contributor

Please apologize my inexperience, but how is the decision made on what to do next? Follow @amueller's suggestion of writing a new class or following @jnothman's suggestions of building the functionality directly (if I understand it correctly) to build the functionality directly into Ordinal- and OneHotEncoder? Or @jnothman was your comment only targeting adding (many) parameters to the classes.

@datajanko
Copy link
Contributor

Sorry for the long delay here and not pushing the discussion further.
I just had some time to modify my approach to a single class:

class GroupingTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, *, min_count=0, min_freq=0.0, top_n=0, group_name='other'):
        self.min_count = min_count
        self.min_freq = min_freq
        self.top_n = top_n
        self.group_name='other'
        
    def fit(self, X, y=None):
        n_samples, n_features = X.shape
        counts = []
        groups = []
        other_in_keys = []
        for i in range(n_features):
            cnts = Counter(X[:, i])
            counts.append(cnts)
            if self.top_n == 0:
                self.top_n = len(cnts)
            labels_to_group = (label for rank, (label, count) in enumerate(cnts.most_common())
                                 if ((count < self.min_count) 
                                 or (count/n_samples < self.min_freq)
                                 or (rank >= self.top_n)
                                    )
                              )
            groups.append(np.array(sorted(set(labels_to_group))))
            other_in_keys.append(self.group_name in cnts.keys())
        self.counts_ = counts
        self.groups_ = groups
        self.other_in_keys_ = other_in_keys
        return self
    
    def transform(self, X):
        X_t = X.copy()
        _, n_features = X.shape
        for i in range(n_features):
            mask = np.isin(X_t[:, i], self.groups_[i])
            X_t[mask, i] = self.group_name
        return X_t

This approach supports min_count, min_freq and top_n. This is still work in progress and I'm not optimizing for other array dtypes like int and float.

With

val1 = np.array([['1', '1'], ['2', '3'], ['1', '2'], ['3', '2'], ['1', '5']], dtype=object)
val2 = np.array([['1', '1'], ['2', '3'], ['1', '2'], ['3', '2'], ['1', '4']])
GT = GroupingTransformer(top_n=2, min_freq=.3)

we obtain

GT.fit_transform(val1)
array([['1', 'other'],
       ['other', 'other'],
       ['1', '2'],
       ['other', '2'],
       ['1', 'other']], dtype=object)

and

GT.fit_transform(val2)
array([['1', 'o'],
       ['o', 'o'],
       ['1', '2'],
       ['o', '2'],
       ['1', 'o']], dtype='<U1')

Right now, I'm just collecting if the "other"-group name is contained in the labels.
Additionally, we see that the numpy string array casts our group name "other" to "o". Typically, this is not what we want. This might necessitate to cast the dtype to a different type or at least modify the state of "other_in_keys".

Moreover, we are only appending to the other group. Nevertheless, I think it should be rather easy to find a new unique column name, of the form other + _ + column_number/other_name + _ + some_fitting_digit.

So questions we need to clarify:

  1. Which dtypes do we want to handle?
  2. Do we want to allow individual parameters for every column?
  3. How do we want to generate a proper other name. Specifically when thinking of int and float arrays?

Any suggestions? Further comments?

@jnothman
Copy link
Member
jnothman commented Nov 5, 2018 via email

@datajanko
Copy link
Contributor
datajanko commented Nov 5, 2018

Of course you are right about the non-determinism when breaking ties. I just wanted to show how it could look like and more work needs to be done there.

One Enhancement of the GroupingEncoder, as I had in mind, would be to pass something like: {'label_a': grouped1, 'label_b': 'grouped1', 'label_c': 'grouped_2', 'label_d': 'grouped_2'}
However, this could be done at other preprocessing steps.

So is the real issue then, to implement NaN first? How will this work with integer encoded data?

@jnothman
Copy link
Member
jnothman commented Nov 5, 2018

If integer + NaN, convert to float! I think/hope it should be a relatively safe cast in this context...

One Enhancement of the GroupingEncoder, as I had in mind, would be to pass something like: {'label_a': grouped1, 'label_b': 'grouped1', 'label_c': 'grouped_2', 'label_d': 'grouped_2'}
However, this could be done at other preprocessing steps.

A big -1 from me. What we don't need is more stateless transformers to do arbitrary preprocessing. That's what FunctionTransformer is for. We can't hope to provide an API for a vast range of data transformation needs. It's just not maintainable.

I would rather something with a lightweight API here.

@amueller
Copy link
Member Author
amueller commented Nov 5, 2018

Happy to have this in the OHE transformer. Sorry I've been busy.

@amueller
Copy link
Member Author
amueller commented Nov 5, 2018

Basically conditioned on the fact that we can have a lightweight API ;)

@amueller
Copy link
Member Author

btw there was a discussion on this topic earlier this year and Sole Galli cited some things from the 2009 KDD competition http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf:

Page 4 of the summary and introductory article:
"For categorical variables, grouping of under-represented categories proved to be useful  to avoid overfitting. The winners of the fast and the slow track used similar strategies consisting in retaining the most populated categories and coarsely grouping the others in  an unsupervised way"

Page 23:
"Most of the learning algorithms we were planning to use do not handle categorical variables,  so we needed to recode them. This was done in a standard way, by generating indicator vari-  ables for the different values a categorical attribute could take. The only slightly non-standard  decision was to limit ourselves to encoding only the 10 most common values of each categorical  attribute, rather than all the values, in order to avoid an explosion in the number of features from  variables with a huge vocabulary"

Page 36:
"We consolidate the extremely low populated entries  (having fewer than 200 examples) with their neighbors to smooth out the outliers. Similarly, we  group some categorical variables which have a large number of entries (  >  1000 distinct values)  into 100 categories."

See bulletpoints in Page 47

These might provide some additional pointers / references.

@thomasjpfan thomasjpfan self-assigned this Nov 22, 2019
@rth
Copy link
Member
rth commented Nov 22, 2019

I currently have a preliminary preprocessor (e.g. a CategoricalPreprocessor) for infrequent categories and NaN handling that I could contribute. Though it only works with dataframes (and relies heavily on pd.Categorical) so maybe it would be more a thing for sklearn-extra.

It could be worthwhile having a separate estimator for this maybe, even if we keep it private in the end, if only to share code between OHE and OrdinalEncoder?

I also agree with @glemaitre in #14953 (comment) that it would be good to have a meta issue on the behavior we want for categories in OHE. Right now there are multiple issues and PRs that pull it in different directions (with partially overlapping discussions).

@thomasjpfan
Copy link
Member

@rth I really do want to see all the issues surrounding encoders resolve for 0.23. I do keep a list of all the encoder issues and PRs revolving them. Although, I think placing all the issues into one meta issue may be kind of overwhelming. I think the issue around missing value has the most overlapping PRs and ideas.

After speaking with @amueller regarding infrequent categories, I propose this API:

Infrequent One hot encoder

ohe = OneHotEncoder(sparse=False, handle_unknown='error', min_frequency=10)
X_train = np.array([['a'] * 5 + ['b'] * 20 + ['c'] * 15 + ['d'] * 2]).T
ohe.fit(X_train)

X_test = np.array([['a', 'b', 'c', 'd']]).T
ohe.tranform(X_test)
# [[0, 0, 1],
#  [1, 0, 0],
#  [0, 1, 0],
#  [0, 0, 1]]

X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)  # errors because there is an unknown

Change handle_unknown to 'auto'

ohe = OneHotEncoder(sparse=False, handle_unknown='auto', min_frequency=10)
ohe.fit(X_train)

X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)
# [[0, 0, 1],
#  [0, 0, 1]]  # unknown is mapped infrequent bin

No infrequent categories in training

ohe = OneHotEncoder(sparse=False, handle_unknown='auto', min_frequency=10)
X_train = np.array([['a'] * 11 + ['b'] * 20 + ['c'] * 15 + ['d'] * 12]).T
ohe.fit(X_train)

X_test = np.array([['a', 'e']]).T
ohe.transform(X_test)
# [[1, 0, 0, 0],
#  [0, 0, 0, 0]]  # the same as ignore

This API uses 'auto' to mean the following:

  1. If there are infrequent categories during fit, during transform, unknown values are mapped to that bin.
  2. If there are no infrequent categories during fit, during transform, unkwnon values will have all zeros. This is the same behavior as handle_unknown='ignore'.

With this 'auto', we can deprecate 'ignore'.

@jnothman
Copy link
Member
jnothman commented Dec 4, 2019 via email

@rth
Copy link
Member
rth commented Dec 4, 2019

I really do want to see all the issues surrounding encoders resolve for 0.23. I do keep a list of all the encoder issues and PRs revolving them.

It would be definitely good to have this addressed for the next release. I tried to write a summary in #15796 but you (and other people in this thread) have spent much more time on so don't hesitate to comment/edit if I missed or misunderstood something.

@catherinenelson1
Copy link

I would love to see a min_frequency option in the OneHotEncoder and would use it frequently :-) thanks for all the work on this, and looking forward to seeing it released in the future.

@zas97
Copy link
zas97 commented May 19, 2021

A hack to create a transformer that does exactly OneHotEncoding with frequency filtering is to use

CountVectorizer(lowercase=False, min_df = 0.01, max_df=1., token_pattern=r"^.*$")

Where min_df is the min fequency and max_df is the max fequency

@glmcdona
Copy link
glmcdona commented Nov 29, 2021

This would be a really useful feature. @thomasjpfan, by any chance are you still working on getting this feature merged? Anything you'd like a hand with?

@thomasjpfan
Copy link
Member

Currently, #16018 is basically complete and needs a second approval from a maintainer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
0