-
Notifications
You must be signed in to change notification settings - Fork 400
Metadata for encoders #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How would you do that with scikit-learn / how do you want to implement this? |
I though that there is some standardized way. But I did not find any reference. Options:
|
The easy / hackish thing to do would be inspecting the signature and see if y is a required argument. |
Yes, but that doesn't tell you for a transformation whether it requires |
In that case it looks like you are the right person to discuss it. I am not picky about the used names or the exact mechanism in which it is implemented. The required functionality could be implemented with the following tags:
I used term "target" instead of the more common "label", because "target" works well for both, continuous and discrete dependent variables, while "label" is arguably appropriate only for discrete dependent variables. While some of the encoders require the target to take values {0,1}, I believe that they should be refactored in some distant future to support targets like {'no', 'yes'}, {'negative', 'positive'} or any other set of exactly two values. Hence, I prefer the more general and future-proof term "binomial" instead of "binary", which would suggest that the target takes only values {0,1}. But of course, the terminology taken by auto-sklearn is nice as well. I am not sure how to handle non-target encoders. Options:
Either way, following tests in 'test_encoders' should not need to hard-code names of the encoders anymore:
There is also one more tag that could be useful:
This tag would be nice in the following test:
|
Another useful metadato could be:
This could help for example in |
Sorry for the slow reply. In sklearn we generally use the word target, and I don't think we use "binary" to be 0 and 1, we use it to mean any two-class classification problem. But you can also extend the encoders to multiclass pretty easily, right? |
Good to know that we are aligned. The binary encoders can be extended to work on multiclass. But I would hesitate to call it easy. And I started to call target encoders "supervised" and non-target encoders "unsupervised". The advantage is that this terminology describes the encoders well and people are already familiar with the terms. |
It should be possible to programmatically differentiate between encoders that:
This information would be useful for:
Proposed implementation:
Like in scikit-learn.
The text was updated successfully, but these errors were encountered: