-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Added UnaryEncoder. #3336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added UnaryEncoder. #3336
Conversation
I have seen work that discretises features into buckets with this sort of scheme. Still, it's hard to explain without an example; could you put together an example comparing this and other categorical encodings? I'm also not sure about the clarity of the name "unary", or is it used in the literature? |
"""Encode natural number features using a unary scheme. | ||
|
||
The input to this transformer should be a matrix of integers, denoting | ||
the values taken on by natural number features, with a meaningful ordering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the values taken on by natural number features" -- do you mean levels of categorical features?
Never mind, those are of course not ordered.
I'm not sure if there is an exact description for this in the literature, A colleague (Jon Barron) has also p
8000
rovided me with a more "jargony" "This unary encoding is reproducing a kernel hilbert space for the On Wed, Jul 2, 2014 at 4:55 PM, jnothman notifications@github.com wrote:
|
This seems related to what Introduction to Statistical Learning calls "step functions" (page 268.) I see the connection to the kernel space that motivates such an encoding, but I also wonder whether binning would not also accomplish the same thing, but leading to a more interpretable model. |
I think this is orthogonal to binning, but have not looked at ESL. It is On 15 July 2014 16:55, Vlad Niculae notifications@github.com wrote:
|
An example is needed to merge this (as well as documentation). It will probably clarify many of the questions on the usecase. |
I have seen this sort of encoding used in semi-supervised approaches, where some statistics are collected on feature occurrences in an unlabelled dataset, and then a supervised system is trained with features along the lines of "in the unlabelled corpus, feature x present in this instance was seen at least k times", which are effectively quantized into bins and then unary-encoded. I think finding a plurality of citations for this technique will be hard. But one example is in Bansal and Klein (ACL 2011), which states "For all features used, we add cumulative variants where indicators are fired for all count bins b' up to query count bin b." I have seen somewhere (I wish I had a reference) a fairly general self-training-like approach which: trains a model on labelled data; runs it over unlabelled data; collects frequencies of feature activation (perhaps for a larger set of features) on the unlabelled data conditioned on the predicted label; stacks these binned, cumulatively-binarized (i.e. UnaryEncoded) conditional frequencies back onto the original feature space. |
I just thought of this PR. If @adarob has no intentions to complete it, it's open for someone to take over to add an example and testing. I note that we currently have no examples using |
The name could indeed be improved.
|
ah, naming... rename |
No description provided.