scikit-learn · NicolasHug · Mar 28, 2017 · Jun 24, 2017 · Jun 25, 2017 · Nov 29, 2017
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -1254,6 +1254,7 @@ Model validation
    preprocessing.Normalizer
    preprocessing.OneHotEncoder
    preprocessing.OrdinalEncoder
+   preprocessing.UnaryEncoder
    preprocessing.PolynomialFeatures
    preprocessing.PowerTransformer
    preprocessing.QuantileTransformer

diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
@@ -451,6 +451,10 @@ The normalizer instance can then be used on sample vectors as any transformer::
 
 Encoding categorical features
 =============================
+
+Ordinal encoding
+----------------
+
 Often features are not given as continuous values but categorical.
 For example a person could have features ``["male", "female"]``,
 ``["from Europe", "from US", "from Asia"]``,
@@ -471,11 +475,22 @@ new feature of integers (0 to n_categories - 1)::
     >>> enc.transform([['female', 'from US', 'uses Safari']])
     array([[0., 1., 1.]])
 
+You can specify the order of the categories by passing the ``categories``
+attribute::
+    >>> enc = preprocessing.OrdinalEncoder(categories=[['big', 'small'],
+    ...                                                ['short', 'tall']])
+    >>> X = [['big', 'tall']]
+    >>> enc.fit_transform(X)  # doctest: +ELLIPSIS
+    array([[0., 1.]])
+
 Such integer representation can, however, not be used directly with all
 scikit-learn estimators, as these expect continuous input, and would interpret
 the categories as being ordered, which is often not desired (i.e. the set of
 browsers was ordered arbitrarily).
 
+One-hot encoding
+----------------
+
 Another possibility to convert categorical features to features that can be used
 with scikit-learn estimators is to use a one-of-K, also known as one-hot or
 dummy encoding.
@@ -539,9 +554,73 @@ columns for this feature will be all zeros
     >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
     array([[1., 0., 0., 0., 0., 0.]])
 
+See :ref:`dict_feature_extraction` for categorical features that are
+represented as a dict, not as scalars.
+
+.. _unary_encoding:
+
+Unary encoding
+--------------
+
+For some ordinal features, it does not necessarily make sense to use
+:class:`OrdinalEncoder` if the difference between the ordered categories is
+uneven, for example with a feature that takes values in "very short",
+"short", "big".
+
+For such features, it is possible to use a unary encoding, which is
+implemented in :class:`UnaryEncoder`. This encoder transforms each ordinal
+feature with ``m`` possible values into ``m - 1`` binary features, where the
+ith feature is active if x > i. For example::
+
+  >>> enc = preprocessing.UnaryEncoder()
+  >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
+  UnaryEncoder(dtype=<... 'numpy.float64'>, handle_greater='warn',
+               max_value='auto', sparse=False)
+  >>> enc.transform([[0, 1, 3]])
+  array([[0., 1., 0., 1., 1., 1.]])
+
+Here the first feature with 2 categories is transformed into 1 column, the
+second feature with 3 values is transformed into 2 columns, and the third
+feature is transformed into 3 columns.
+
+By default, the number of categories in a feature is inferred automatically
+from the dataset by looking for the maximum value. It is possible to specify
+this explicitly using the parameter ``max_value``. In particular if the
+training data might have missing categorical features, one has to explicitly
+set ``max_value``. For example,::
+
+  >>> enc = preprocessing.UnaryEncoder(max_value=[2, 3, 4])
+  >>> # Note that there are missing categorical values for the 2nd and 3rd
+  >>> # features
+  >>> enc.fit([[1, 2, 3], [0, 2, 0]])  # doctest: +ELLIPSIS
+  UnaryEncoder(dtype=<... 'numpy.float64'>, handle_greater='warn',
+               max_value=[2, 3, 4], sparse=False)
+  >>> enc.transform([[1, 1, 2]])
+  array([[1., 1., 0., 1., 1., 0.]])
+
+.. note::
+
+  This encoding is likely to help when used with linear models and
+  kernel-based models like SVMs with the standard kernels. On the other
+  hand, this transformation is unlikely to help when using tree-based
+  models, since those already work on the basis of a particular feature
+  value being less or bigger than a threshold.
+
+In case the input variable is not represented as a number from 0 to
+``max_value``, it is possible to combine :class:`UnaryEncoder` and
+:class:`OrdinalEncoder` into a :class:`Pipeline <sklearn.pipeline.Pipeline>`
+like so::
+
+  >>> from sklearn.pipeline import make_pipeline
+  >>> from sklearn.preprocessing import OrdinalEncoder, UnaryEncoder
+  >>> categories = [['small', 'medium', 'huge']]
+  >>> pipeline = make_pipeline(OrdinalEncoder(categories), UnaryEncoder())
+  >>> X = [['small'], ['medium'], ['huge']]
+  >>> pipeline.fit_transform(X)
+  array([[0., 0.],
+         [1., 0.],
+         [1., 1.]])
 
-See :ref:`dict_feature_extraction` for categorical features that are represented
-as a dict, not as scalars.
 
 .. _preprocessing_discretization:
 

diff --git a/doc/whats_new/v0.21.rst b/doc/whats_new/v0.21.rst
@@ -183,6 +183,11 @@ Support for Python 3.4 and below has been officially dropped.
   in the dense case. Also added a new parameter ``order`` which controls output
   order for further speed performances. :issue:`12251` by `Tom Dupre la Tour`_.
 
+- |Feature| Added a new encoder :class:`preprocessing.UnaryEncoding`, useful
+  for ordinal features with uneven differences between categories.
+  :issue:`12893` by :user:`Ruxandra Burtica <ruxandraburtica>`, :user:`Arjun
+  Jauhari <arjunjauhari>` and :user:`Nicolas Hug <NicolasHug>`.
+
 :mod:`sklearn.tree`
 ...................
 - |Feature| Decision Trees can now be plotted with matplotlib using

diff --git a/sklearn/preprocessing/__init__.py b/sklearn/preprocessing/__init__.py
@@ -27,6 +27,7 @@
 
 from ._encoders import OneHotEncoder
 from ._encoders import OrdinalEncoder
+from ._encoders import UnaryEncoder
 
 from .label import label_binarize
 from .label import LabelBinarizer
@@ -53,6 +54,7 @@
     'Normalizer',
     'OneHotEncoder',
     'OrdinalEncoder',
+    'UnaryEncoder',
     'PowerTransformer',
     'RobustScaler',
     'StandardScaler',