Added UnaryEncoder. #3336

adarob · 2014-07-02T23:35:24Z

No description provided.

jnothman · 2014-07-02T23:55:29Z

I have seen work that discretises features into buckets with this sort of scheme. Still, it's hard to explain without an example; could you put together an example comparing this and other categorical encodings?

I'm also not sure about the clarity of the name "unary", or is it used in the literature?

larsmans · 2014-07-03T13:17:10Z

sklearn/preprocessing/data.py

+    """Encode natural number features using a unary scheme.
+
+    The input to this transformer should be a matrix of integers, denoting
+    the values taken on by natural number features, with a meaningful ordering.


~~"the values taken on by natural number features" -- do you mean levels of categorical features?~~

Never mind, those are of course not ordered.

adarob · 2014-07-03T18:21:19Z

I'm not sure if there is an exact description for this in the literature,
so I don't really have a better suggestion for a name. Essentially what it
does is split a natural number feature into binary features so that each
signals whether the original feature is "greater than or equal to" some
number. An example of where this is useful would be if we wanted to use a
linear classifier to predict whether or not someone is pregnant and a
feature is the woman's age. In this case the response to the signal is
non-linear: the difference between age 25 and 30 is much smaller than the
difference between 40 and 45. By splitting the age into separate binary
features, we can still deal with this nonlinearity using a linear
classifier.

A colleague (Jon Barron) has also p 8000 rovided me with a more "jargony"
explanation:

"This unary encoding is reproducing a kernel hilbert space for the
intersection kernel, which is an effective kernel for "histogram"-like
features. Instead of working in kernel space, it's simpler to just expand
the feature out into this unary encoding space and use a linear classifier,
which is exactly equivalent."

On Wed, Jul 2, 2014 at 4:55 PM, jnothman notifications@github.com wrote:

I have seen work that discretises features into buckets with this sort of
scheme. Still, it's hard to explain without an example; could you put
together an example comparing this and other categorical encodings?

I'm also not sure about the clarity of the name "unary", or is it used in
the literature?

—
Reply to this email directly or view it on GitHub
#3336 (comment)
.

vene · 2014-07-15T06:55:08Z

This seems related to what Introduction to Statistical Learning calls "step functions" (page 268.) I see the connection to the kernel space that motivates such an encoding, but I also wonder whether binning would not also accomplish the same thing, but leading to a more interpretable model.

jnothman · 2014-07-15T07:12:48Z

I think this is orthogonal to binning, but have not looked at ESL. It is
used to represent "I have counted this object at least this many times",
rather than binning's "in the order of this many times".

On 15 July 2014 16:55, Vlad Niculae notifications@github.com wrote:

This seems related to what Introduction to Statistical Learning calls
"step functions" (page 268.) I see the connection to the kernel space that
motivates such an encoding, but I also wonder whether binning would not
also accomplish the same thing, but leading to a more interpretable model.

—
Reply to this email directly or view it on GitHub
#3336 (comment)
.

GaelVaroquaux · 2014-07-15T07:39:45Z

An example is needed to merge this (as well as documentation). It will probably clarify many of the questions on the usecase.

jnothman · 2014-08-06T01:35:21Z

I have seen this sort of encoding used in semi-supervised approaches, where some statistics are collected on feature occurrences in an unlabelled dataset, and then a supervised system is trained with features along the lines of "in the unlabelled corpus, feature x present in this instance was seen at least k times", which are effectively quantized into bins and then unary-encoded.

I think finding a plurality of citations for this technique will be hard. But one example is in Bansal and Klein (ACL 2011), which states "For all features used, we add cumulative variants where indicators are fired for all count bins b' up to query count bin b."

I have seen somewhere (I wish I had a reference) a fairly general self-training-like approach which: trains a model on labelled data; runs it over unlabelled data; collects frequencies of feature activation (perhaps for a larger set of features) on the unlabelled data conditioned on the predicted label; stacks these binned, cumulatively-binarized (i.e. UnaryEncoded) conditional frequencies back onto the original feature space.

jnothman · 2015-07-29T04:58:30Z

I just thought of this PR. If @adarob has no intentions to complete it, it's open for someone to take over to add an example and testing. I note that we currently have no examples using OneHotEncoder.

GaelVaroquaux · 2015-07-29T06:02:38Z

The name could indeed be improved.

jnothman · 2015-07-29T06:30:11Z

ah, naming... rename Binarizer to Binner or Thresholder then merge this with OneHotEncoder as Binarizer with a mode parameter? Not being serious... but naming is hard.

qinhanmin2014 · 2018-07-28T14:25:40Z

Close in favor of #8652, thanks @adarob

Added UnaryEncoder.

fe0e1f2

larsmans reviewed Jul 3, 2014
View reviewed changes

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

arjunjauhari mentioned this pull request Jul 11, 2018

[MRG+1] UnaryEncoder to encode ordinal features into unary levels #8652

Closed

5 tasks

qinhanmin2014 closed this Jul 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Added UnaryEncoder. #3336

Added UnaryEncoder. #3336

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Added UnaryEncoder. #3336

Added UnaryEncoder. #3336

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!