Metadata for encoders #166

janmotl · 2019-01-05T20:44:42Z

It should be possible to programmatically differentiate between encoders that:

Do not require any target during fitting (like OneHotEncoder).
Require some target during fitting (like TargetEncoder).
Require binary target during fitting (like WOE).

This information would be useful for:

Parameterized tests.
Users that wonder, which encoder they may use.

Proposed implementation:
Like in scikit-learn.

amueller · 2019-02-05T22:47:00Z

How would you do that with scikit-learn / how do you want to implement this?

janmotl · 2019-02-06T15:32:23Z

I though that there is some standardized way. But I did not find any reference. Options:

Create a few abstract classes. And based on the inheritance we could learn the properties of the encoders.
The encoders could implement a couple of methods like 'accepts_continuous_target()' and 'accepts_binary_target()' or implement a single method like supportsCapability in RapidMiner.
The encoders could have attributes like 'accepts_continuous_target' and 'accepts_binary_target'.

amueller · 2019-02-06T16:08:10Z

The easy / hackish thing to do would be inspecting the signature and see if y is a required argument.
There's hopefully soon estimator tags in sklearn which will allow you to specify this kind of information but they are not really standardized yet.

janmotl · 2019-02-06T16:42:21Z

I found it: https://scikit-learn.org/stable/developers/contributing.html#estimator-types

amueller · 2019-02-06T22:59:18Z

Yes, but that doesn't tell you for a transformation whether it requires y to fit (I probably wrote that section).

janmotl · 2019-02-07T08:36:37Z

In that case it looks like you are the right person to discuss it.

I am not picky about the used names or the exact mechanism in which it is implemented. The required functionality could be implemented with the following tags:

supports continuous target
supports binomial target

I used term "target" instead of the more common "label", because "target" works well for both, continuous and discrete dependent variables, while "label" is arguably appropriate only for discrete dependent variables.

While some of the encoders require the target to take values {0,1}, I believe that they should be refactored in some distant future to support targets like {'no', 'yes'}, {'negative', 'positive'} or any other set of exactly two values. Hence, I prefer the more general and future-proof term "binomial" instead of "binary", which would suggest that the target takes only values {0,1}.

But of course, the terminology taken by auto-sklearn is nice as well.

I am not sure how to handle non-target encoders. Options:

introduce 'is_target_encoder' and 'is_encoder' as an analogy to 'is_classifier' and 'is_regressor'
Return false for both tags.
Something completely else.

Either way, following tests in 'test_encoders' should not need to hard-code names of the encoders anymore:

test_impact_encoders
test_tmp_column_name
test_unique_column_is_not_predictive
test_get_feature_names
test_get_feature_names_drop_invariant

There is also one more tag that could be useful:

supports unknown/new target values in the test set

This tag would be nice in the following test:

test_handle_unknown_error

janmotl · 2019-02-11T09:23:01Z

Another useful metadato could be:

supports inverse transform

This could help for example in test_inverse_transform

amueller · 2019-03-12T20:56:14Z

Sorry for the slow reply. In sklearn we generally use the word target, and I don't think we use "binary" to be 0 and 1, we use it to mean any two-class classification problem. But you can also extend the encoders to multiclass pretty easily, right?

janmotl · 2019-03-13T08:12:08Z

Good to know that we are aligned. The binary encoders can be extended to work on multiclass. But I would hesitate to call it easy.

And I started to call target encoders "supervised" and non-target encoders "unsupervised". The advantage is that this terminology describes the encoders well and people are already familiar with the terms.

janmotl added enhancement help wanted labels Jan 5, 2019

janmotl mentioned this issue Jan 8, 2019

New target encoders #164

Closed

PaulWestenthanner mentioned this issue Oct 27, 2021

Check array index fix #320

Merged

PaulWestenthanner mentioned this issue Nov 28, 2021

Refactor/base class #325

Merged

bmreiniger mentioned this issue Nov 28, 2021

Add base encoder #326

Closed

PaulWestenthanner mentioned this issue Feb 6, 2022

Add a warning when encoders are used with an inappropriate target type #335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata for encoders #166

Metadata for encoders #166

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Metadata for encoders #166

Metadata for encoders #166

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!