8000 [MRG] Unary encoder -- continued by NicolasHug · Pull Request #12893 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Unary encoder -- continued #12893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
38537ed
Rebase and commit the OrdinalEncoder implementation
arjunjauhari Mar 28, 2017
a850082
Updating name to UnaryEncoder and adding single quote in error string
arjunjauhari Jun 24, 2017
661951c
Merged changes from #9216
ruxandraburtica Jun 25, 2017
eb8bc94
Removing active_features_ attribute from UnaryEncoder as it is not ne…
arjunjauhari Nov 29, 2017
88d5eb4
Limiting the lines in documentation to less that 80 chars
arjunjauhari Nov 29, 2017
cd21cbf
Updated documentation. Changed the default value of sparse parameter …
arjunjauhari Nov 29, 2017
81af018
Updated test cases to accomodate change in default value of sparse pa…
arjunjauhari Nov 30, 2017
f4ba310
Commit to accomodate all the requested changes
arjunjauhari Dec 1, 2017
0706c29
Fixing DocTestFailure
arjunjauhari Dec 1, 2017
b642a7e
Refactoring the code. Now fit_transform in equivalent to fit +
arjunjauhari Dec 3, 2017
367dba4
Minor change in mask calculation.
arjunjauhari Dec 4, 2017
9f3205d
Adding warn as a new option for handle_greater parameter.
arjunjauhari Dec 4, 2017
c23ec8d
Updating test case to take care of new handle_greater='warn' as default
arjunjauhari Dec 4, 2017
9d4753a
Fixing concerns.
arjunjauhari Dec 5, 2017
a43dfb5
Resolved conflicts with master
arjunjauhari Dec 19, 2017
7cbf733
8000 Merge branch 'master' into unary_encoder_continued
NicolasHug Dec 29, 2018
e2a01bb
some doc modif, TBC
NicolasHug Dec 29, 2018
fc6a9af
removed ordinal_features
NicolasHug Dec 29, 2018
645f6a3
Removed calls to _transform_selected
NicolasHug Dec 29, 2018
4691026
Addressed Joris' comments
NicolasHug Dec 29, 2018
35cbbe4
Added inverse tranform
NicolasHug Dec 30, 2018
b4609fe
Updated user guide
NicolasHug Dec 30, 2018
98c0e5a
Added whatsnew
NicolasHug Dec 30, 2018
7886fb7
Removed from __all__ in preprocesing.data
NicolasHug Dec 30, 2018
e96b438
Fixed typo
NicolasHug Dec 30, 2018
4d690e9
Added @ruxandraburtica as author
NicolasHug Dec 31, 2018
6d4cb04
renamed n_values into categories
NicolasHug Jan 4, 2019
d5ec004
Added example for combining with OrdinalEncoder in user guide
NicolasHug Jan 4, 2019
b56ef10
Merge branch 'master' into unary_encoder_continued
NicolasHug Jan 4, 2019
e1f2f3f
Removed six usage
NicolasHug Jan 4, 2019
54861d2
Addressed most comments from Joel
NicolasHug Jan 7, 2019
192e042
typos
NicolasHug Jan 7, 2019
57538f0
inverse transform now accepts non-binary input
NicolasHug Jan 8, 2019
7af673b
newline EOF
NicolasHug Jan 8, 2019
e2c90dd
Added section in OrdinalEncoder user guide for specifying categories
NicolasHug Jan 10, 2019
5d69bf9
changed categories param to max_value
NicolasHug Jan 14, 2019
5350ba8
Addressed comments
NicolasHug Jan 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1254,6 +1254,7 @@ Model validation
preprocessing.Normalizer
preprocessing.OneHotEncoder
preprocessing.OrdinalEncoder
preprocessing.UnaryEncoder
preprocessing.PolynomialFeatures
preprocessing.PowerTransformer
preprocessing.QuantileTransformer
Expand Down
83 changes: 81 additions & 2 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,10 @@ The normalizer instance can then be used on sample vectors as any transformer::

Encoding categorical features
=============================

Ordinal encoding
----------------

Often features are not given as continuous values but categorical.
For example a person could have features ``["male", "female"]``,
``["from Europe", "from US", "from Asia"]``,
Expand All @@ -471,11 +475,22 @@ new feature of integers (0 to n_categories - 1)::
>>> enc.transform([['female', 'from US', 'uses Safari']])
array([[0., 1., 1.]])

You can specify the order of the categories by passing the ``categories``
attribute::
>>> enc = preprocessing.OrdinalEncoder(categories=[['big', 'small'],
... ['short', 'tall']])
>>> X = [['big', 'tall']]
>>> enc.fit_transform(X) # doctest: +ELLIPSIS
array([[0., 1.]])

Such integer representation can, however, not be used directly with all
scikit-learn estimators, as these expect continuous input, and would interpret
the categories as being ordered, which is often not desired (i.e. the set of
browsers was ordered arbitrarily).

One-hot encoding
----------------

Another possibility to convert categorical features to features that can be used
with scikit-learn estimators is to use a one-of-K, also known as one-hot or
dummy encoding.
Expand Down Expand Up @@ -539,9 +554,73 @@ columns for this feature will be all zeros
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])

See :ref:`dict_feature_extraction` for categorical features that are
represented as a dict, not as scalars.

.. _unary_encoding:

Unary encoding
--------------

For some ordinal features, it does not necessarily make sense to use
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about how to motivate this encoder properly. Please feel free to suggest anything better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "if the difference between ordered categories is uneven".

How about

Ordinal variables can also be represented in unary, where one feature is created for each possible value, as in one-hot encoding, but several features are active at a time (1, 3 and 5 would be encoded as row vectors [1, 0, 0, 0, 0], [1, 1, 1, 0, 0] and [1, 1, 1, 1, 1] respectively, for instance).

For generalized linear models, a one-hot encoding requires learning a weight for each value of an ordinal variable; an ordinal encoding learns a single weight for the variable, assuming a smooth (e.g. linear) relationship between the variable and target. Unary encoding is not strictly more expressive than one-hot, but allows the model to learn the non-linear effect of increasing an ordinal variable with fewer large or non-zero parameters. For similar reasons, unary may be a worthwhile encoding for linear modelling of discretized variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an example of an ordinal variable would then be helpful

:class:`OrdinalEncoder` if the difference between the ordered categories is
uneven, for example with a feature that takes values in "very short",
"short", "big".

For such features, it is possible to use a unary encoding, which is
implemented in :class:`UnaryEncoder`. This encoder transforms each ordinal
feature with ``m`` possible values into ``m - 1`` binary features, where the
ith feature is active if x > i. For example::

>>> enc = preprocessing.UnaryEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
UnaryEncoder(dtype=<... 'numpy.float64'>, handle_greater='warn',
max_value='auto', sparse=False)
>>> enc.transform([[0, 1, 3]])
array([[0., 1., 0., 1., 1., 1.]])

Here the first feature with 2 categories is transformed into 1 column, the
second feature with 3 values is transformed into 2 columns, and the third
feature is transformed into 3 columns.

By default, the number of categories in a feature is inferred automatically
from the dataset by looking for the maximum value. It is possible to specify
this explicitly using the parameter ``max_value``. In particular if the
training data might have missing categorical features, one has to explicitly
set ``max_value``. For example,::

>>> enc = preprocessing.UnaryEncoder(max_value=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
UnaryEncoder(dtype=<... 'numpy.float64'>, handle_greater='warn',
max_value=[2, 3, 4], sparse=False)
>>> enc.transform([[1, 1, 2]])
array([[1., 1., 0., 1., 1., 0.]])

.. note::

This encoding is likely to help when used with linear models and
kernel-based models like SVMs with the standard kernels. On the other
hand, this transformation is unlikely to help when using tree-based
models, since those already work on the basis of a particular feature
value being less or bigger than a threshold.

In case the input variable is not represented as a number from 0 to
``max_value``, it is possible to combine :class:`UnaryEncoder` and
:class:`OrdinalEncoder` into a :class:`Pipeline <sklearn.pipeline.Pipeline>`
like so::

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import OrdinalEncoder, UnaryEncoder
>>> categories = [['small', 'medium', 'huge']]
>>> pipeline = make_pipeline(OrdinalEncoder(categories), UnaryEncoder())
>>> X = [['small'], ['medium'], ['huge']]
>>> pipeline.fit_transform(X)
array([[0., 0.],
[1., 0.],
[1., 1.]])

See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as scalars.

.. _preprocessing_discretization:

Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,11 @@ Support for Python 3.4 and below has been officially dropped.
in the dense case. Also added a new parameter ``order`` which controls output
order for further speed performances. :issue:`12251` by `Tom Dupre la Tour`_.

- |Feature| Added a new encoder :class:`preprocessing.UnaryEncoding`, useful
for ordinal features with uneven differences between categories.
:issue:`12893` by :user:`Ruxandra Burtica <ruxandraburtica>`, :user:`Arjun
Jauhari <arjunjauhari>` and :user:`Nicolas Hug <NicolasHug>`.

:mod:`sklearn.tree`
...................
- |Feature| Decision Trees can now be plotted with matplotlib using
Expand Down
2 changes: 2 additions & 0 deletions sklearn/preprocessing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@

from ._encoders import OneHotEncoder
from ._encoders import OrdinalEncoder
from ._encoders import UnaryEncoder

from .label import label_binarize
from .label import LabelBinarizer
Expand All @@ -53,6 +54,7 @@
'Normalizer',
'OneHotEncoder',
'OrdinalEncoder',
'UnaryEncoder',
'PowerTransformer',
'RobustScaler',
'StandardScaler',
Expand Down
Loading
0