10000 Added UnaryEncoder. by adarob · Pull Request #3336 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Added UnaryEncoder. #3336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Added UnaryEncoder. #3336

wants to merge 1 commit into from

Conversation

adarob
Copy link
@adarob adarob commented Jul 2, 2014

No description provided.

@jnothman
Copy link
Member
jnothman commented Jul 2, 2014

I have seen work that discretises features into buckets with this sort of scheme. Still, it's hard to explain without an example; could you put together an example comparing this and other categorical encodings?

I'm also not sure about the clarity of the name "unary", or is it used in the literature?

"""Encode natural number features using a unary scheme.

The input to this transformer should be a matrix of integers, denoting
the values taken on by natural number features, with a meaningful ordering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the values taken on by natural number features" -- do you mean levels of categorical features?

Never mind, those are of course not ordered.

@adarob
Copy link
Author
adarob commented Jul 3, 2014

I'm not sure if there is an exact description for this in the literature,
so I don't really have a better suggestion for a name. Essentially what it
does is split a natural number feature into binary features so that each
signals whether the original feature is "greater than or equal to" some
number. An example of where this is useful would be if we wanted to use a
linear classifier to predict whether or not someone is pregnant and a
feature is the woman's age. In this case the response to the signal is
non-linear: the difference between age 25 and 30 is much smaller than the
difference between 40 and 45. By splitting the age into separate binary
features, we can still deal with this nonlinearity using a linear
classifier.

A colleague (Jon Barron) has also p 8000 rovided me with a more "jargony"
explanation:

"This unary encoding is reproducing a kernel hilbert space for the
intersection kernel, which is an effective kernel for "histogram"-like
features. Instead of working in kernel space, it's simpler to just expand
the feature out into this unary encoding space and use a linear classifier,
which is exactly equivalent."

On Wed, Jul 2, 2014 at 4:55 PM, jnothman notifications@github.com wrote:

I have seen work that discretises features into buckets with this sort of
scheme. Still, it's hard to explain without an example; could you put
together an example comparing this and other categorical encodings?

I'm also not sure about the clarity of the name "unary", or is it used in
the literature?


Reply to this email directly or view it on GitHub
#3336 (comment)
.

@vene
Copy link
Member
vene commented Jul 15, 2014

This seems related to what Introduction to Statistical Learning calls "step functions" (page 268.) I see the connection to the kernel space that motivates such an encoding, but I also wonder whether binning would not also accomplish the same thing, but leading to a more interpretable model.

@jnothman
Copy link
Member

I think this is orthogonal to binning, but have not looked at ESL. It is
used to represent "I have counted this object at least this many times",
rather than binning's "in the order of this many times".

On 15 July 2014 16:55, Vlad Niculae notifications@github.com wrote:

This seems related to what Introduction to Statistical Learning calls
"step functions" (page 268.) I see the connection to the kernel space that
motivates such an encoding, but I also wonder whether binning would not
also accomplish the same thing, but leading to a more interpretable model.


Reply to this email directly or view it on GitHub
#3336 (comment)
.

@GaelVaroquaux
Copy link
Member

An example is needed to merge this (as well as documentation). It will probably clarify many of the questions on the usecase.

@jnothman
Copy link
Member
jnothman commented Aug 6, 2014

I have seen this sort of encoding used in semi-supervised approaches, where some statistics are collected on feature occurrences in an unlabelled dataset, and then a supervised system is trained with features along the lines of "in the unlabelled corpus, feature x present in this instance was seen at least k times", which are effectively quantized into bins and then unary-encoded.

I think finding a plurality of citations for this technique will be hard. But one example is in Bansal and Klein (ACL 2011), which states "For all features used, we add cumulative variants where indicators are fired for all count bins b' up to query count bin b."

I have seen somewhere (I wish I had a reference) a fairly general self-training-like approach which: trains a model on labelled data; runs it over unlabelled data; collects frequencies of feature activation (perhaps for a larger set of features) on the unlabelled data conditioned on the predicted label; stacks these binned, cumulatively-binarized (i.e. UnaryEncoded) conditional frequencies back onto the original feature space.

@jnothman
Copy link
Member

I just thought of this PR. If @adarob has no intentions to complete it, it's open for someone to take over to add an example and testing. I note that we currently have no examples using OneHotEncoder.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jul 29, 2015 via email

@jnothman
Copy link
Member

ah, naming... rename Binarizer to Binner or Thresholder then merge this with OneHotEncoder as Binarizer with a mode parameter? Not being serious... but naming is hard.

@qinhanmin2014
Copy link
Member

Close in favor of #8652, thanks @adarob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0