8000 Metadata for encoders · Issue #166 · scikit-learn-contrib/category_encoders · GitHub
[go: up one dir, main page]

Skip to content

Metadata for encoders #166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
janmotl opened this issue Jan 5, 2019 · 9 comments
Open

Metadata for encoders #166

janmotl opened this issue Jan 5, 2019 · 9 comments

Comments

@janmotl
Copy link
Collaborator
janmotl commented Jan 5, 2019

It should be possible to programmatically differentiate between encoders that:

  1. Do not require any target during fitting (like OneHotEncoder).
  2. Require some target during fitting (like TargetEncoder).
  3. Require binary target during fitting (like WOE).

This information would be useful for:

  1. Parameterized tests.
  2. Users that wonder, which encoder they may use.

Proposed implementation:
Like in scikit-learn.

@amueller
Copy link
Member
amueller commented Feb 5, 2019

How would you do that with scikit-learn / how do you want to implement this?

Copy link
Collaborator Author
janmotl commented Feb 6, 2019

I though that there is some standardized way. But I did not find any reference. Options:

  1. Create a few abstract classes. And based on the inheritance we could learn the properties of the encoders.
  2. The encoders could implement a couple of methods like 'accepts_continuous_target()' and 'accepts_binary_target()' or implement a single method like supportsCapability in RapidMiner.
  3. The encoders could have attributes like 'accepts_continuous_target' and 'accepts_binary_target'.

@amueller
Copy link
Member
amueller commented Feb 6, 2019

The easy / hackish thing to do would be inspecting the signature and see if y is a required argument.
There's hopefully soon estimator tags in sklearn which will allow you to specify this kind of information but they are not really standardized yet.

@janmotl
Copy link
Collaborator Author
janmotl commented Feb 6, 2019

@amueller
Copy link
Member
amueller commented Feb 6, 2019

Yes, but that doesn't tell you for a transformation whether it requires y to fit (I probably wrote that section).

@janmotl
Copy link
Collaborator Author
janmotl commented Feb 7, 2019

In that case it looks like you are the right person to discuss it.

I am not picky about the used names or the exact mechanism in which it is implemented. The required functionality could be implemented with the following tags:

  • supports continuous target
  • supports binomial target

I used term "target" instead of the more common "label", because "target" works well for both, continuous and discrete dependent variables, while "label" is arguably appropriate only for discrete dependent variables.

While some of the encoders require the target to take values {0,1}, I believe that they should be refactored in some distant future to support targets like {'no', 'yes'}, {'negative', 'positive'} or any other set of exactly two values. Hence, I prefer the more general and future-proof term "binomial" instead of "binary", which would suggest that the target takes only values {0,1}.

But of course, the terminology taken by auto-sklearn is nice as well.

I am not sure how to handle non-target encoders. Options:

  1. introduce 'is_target_encoder' and 'is_encoder' as an analogy to 'is_classifier' and 'is_regressor'
  2. Return false for both tags.
  3. Something completely else.

Either way, following tests in 'test_encoders' should not need to hard-code names of the encoders anymore:

  1. test_impact_encoders
  2. test_tmp_column_name
  3. test_unique_column_is_not_predictive
  4. test_get_feature_names
  5. test_get_feature_names_drop_invariant

There is also one more tag that could be useful:

  • supports unknown/new target values in the test set

This tag would be nice in the following test:

  1. test_handle_unknown_error

@janmotl
Copy link
Collaborator Author
janmotl commented Feb 11, 2019

Another useful metadato could be:

  • supports inverse transform

This could help for example in test_inverse_transform

@amueller
Copy link
Member

Sorry for the slow reply. In sklearn we generally use the word target, and I don't think we use "binary" to be 0 and 1, we use it to mean any two-class classification problem. But you can also extend the encoders to multiclass pretty easily, right?

@janmotl
Copy link
Collaborator Author
janmotl commented Mar 13, 2019

Good to know that we are aligned. The binary encoders can be extended to work on multiclass. But I would hesitate to call it easy.

And I started to call target encoders "supervised" and non-target encoders "unsupervised". The advantage is that this terminology describes the encoders well and people are already familiar with the terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
0