8000 ENH: new CategoricalEncoder class (#9151) · scikit-learn/scikit-learn@a2ebb8c · GitHub
[go: up one dir, main page]

Skip to content

Commit a2ebb8c

Browse files
jorisvandenbosschejnothman
authored andcommitted
ENH: new CategoricalEncoder class (#9151)
1 parent bbdcd70 commit a2ebb8c

File tree

12 files changed

+744
-46
lines changed

12 files changed

+744
-46
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1198,6 +1198,7 @@ Model validation
11981198
preprocessing.MinMaxScaler
11991199
preprocessing.Normalizer
12001200
preprocessing.OneHotEncoder
1201+
preprocessing.CategoricalEncoder
12011202
preprocessing.PolynomialFeatures
12021203
preprocessing.QuantileTransformer
12031204
preprocessing.RobustScaler

doc/modules/preprocessing.rst

Lines changed: 76 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -455,47 +455,87 @@ Such features can be efficiently coded as integers, for instance
455455
``[0, 1, 3]`` while ``["female", "from Asia", "uses Chrome"]`` would be
456456
``[1, 2, 1]``.
457457

458-
Such integer representation can not be used directly with scikit-learn estimators, as these
459-
expect continuous input, and would interpret the categories as being ordered, which is often
460-
not desired (i.e. the set of browsers was ordered arbitrarily).
461-
462-
One possibility to convert categorical features to features that can be used
463-
with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is
464-
implemented in :class:`OneHotEncoder`. This estimator transforms each
465-
categorical feature with ``m`` possible values into ``m`` binary features, with
466-
only one active.
458+
To convert categorical features to such integer codes, we can use the
459+
:class:`CategoricalEncoder`. When specifying that we want to perform an
460+
ordinal encoding, the estimator transforms each categorical feature to one
461+
new feature of integers (0 to n_categories - 1)::
462+
463+
>>> enc = preprocessing.CategoricalEncoder(encoding='ordinal')
464+
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
465+
>>> enc.fit(X) # doctest: +ELLIPSIS
466+
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
467+
encoding='ordinal', handle_unknown='error')
468+
>>> enc.transform([['female', 'from US', 'uses Safari']])
469+
array([[ 0., 1., 1.]])
470+
471+
Such integer representation can, however, not be used directly with all
472+
scikit-learn estimators, as these expect continuous input, and would interpret
473+
the categories as being ordered, which is often not desired (i.e. the set of
474+
browsers was ordered arbitrarily).
475+
476+
Another possibility to convert categorical features to features that can be used
477+
with scikit-learn estimators is to use a one-of-K, also known as one-hot or
478+
dummy encoding.
479+
This type of encoding is the default behaviour of the :class:`CategoricalEncoder`.
480+
The :class:`CategoricalEncoder` then transforms each categorical feature with
481+
``n_categories`` possible values into ``n_categories`` binary features, with
482+
one of them 1, and all others 0.
467483

468484
Continuing the example above::
469485

470-
>>> enc = preprocessing.OneHotEncoder()
471-
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
472-
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
473-
handle_unknown='error', n_values='auto', sparse=True)
474-
>>> enc.transform([[0, 1, 3]]).toarray()
475-
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
476-
477-
By default, how many values each feature can take is inferred automatically from the dataset.
478-
It is possible to specify this explicitly using the parameter ``n_values``.
479-
There are two genders, three possible continents and four web browsers in our
480-
dataset.
481-
Then we fit the estimator, and transform a data point.
482-
In the result, the first two numbers encode the gender, the next set of three
483-
numbers the continent and the last four the web browser.
484-
485-
Note that, if there is a possibility that the training data might have missing categorical
486-
features, one has to explicitly set ``n_values``. For example,
487-
488-
>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
489-
>>> # Note that there are missing categorical values for the 2nd and 3rd
490-
>>> # features
491-
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
492-
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
493-
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
494-
>>> enc.transform([[1, 0, 0]]).toarray()
495-
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
486+
>>> enc = preprocessing.CategoricalEncoder()
487+
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
488+
>>> enc.fit(X) # doctest: +ELLIPSIS
489+
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
490+
encoding='onehot', handle_unknown='error')
491+
>>> enc.transform([['female', 'from US', 'uses Safari'],
492+
... ['male', 'from Europe', 'uses Safari']]).toarray()
493+
array([[ 1., 0., 0., 1., 0., 1.],
494+
[ 0., 1., 1., 0., 0., 1.]])
495+
496+
By default, the values each feature can take is inferred automatically
497+
from the dataset and can be found in the ``categories_`` attribute::
498+
499+
>>> enc.categories_
500+
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
501+
502+
It is possible to specify this explicitly using the parameter ``categories``.
503+
There are two genders, four possible continents and four web browsers in our
504+
dataset::
505+
506+
>>> genders = ['female', 'male']
507+
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
508+
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
509+
>>> enc = preprocessing.CategoricalEncoder(categories=[genders, locations, browsers])
510+
>>> # Note that for there are missing categorical values for the 2nd and 3rd
511+
>>> # feature
512+
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
513+
>>> enc.fit(X) # doctest: +ELLIPSIS
514+
CategoricalEncoder(categories=[...],
515+
dtype=<... 'numpy.float64'>, encoding='onehot',
516+
handle_unknown='error')
517+
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
518+
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
519+
520+
If there is a possibility that the training data might have missing categorical
521+
features, it can often be better to specify ``handle_unknown='ignore'`` instead
522+
of setting the ``categories`` manually as above. When
523+
``handle_unknown='ignore'`` is specified and unknown categories are encountered
524+
during transform, no error will be raised but the resulting one-hot encoded
525+
columns for this feature will be all zeros
526+
(``handle_unknown='ignore'`` is only supported for one-hot encoding)::
527+
528+
>>> enc = preprocessing.CategoricalEncoder(handle_unknown='ignore')
529+
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
530+
>>> enc.fit(X) # doctest: +ELLIPSIS
531+
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
532+
encoding='onehot', handle_unknown='ignore')
533+
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
534+
array([[ 1., 0., 0., 0., 0., 0.]])
535+
496536

497537
See :ref:`dict_feature_extraction` for categorical features that are represented
498-
as a dict, not as integers.
538+
as a dict, not as scalars.
499539

500540
.. _imputation:
501541

doc/whats_new/_contributors.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,5 @@
149149
.. _Neeraj Gangwar: http://neerajgangwar.in
150150

151151
.. _Arthur Mensch: https://amensch.fr
152+
153+
.. _Joris Van den Bossche: https://github.com/jorisvandenbossche

doc/whats_new/v0.20.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,17 @@ Classifiers and regressors
4545
Naive Bayes classifier described in Rennie et al. (2003).
4646
By :user:`Michael A. Alcorn <airalcorn2>`.
4747

48+
Preprocessing
49+
50+
- Added :class:`preprocessing.CategoricalEncoder`, which allows to encode
51+
categorical features as a numeric array, either using a one-hot (or
52+
dummy) encoding scheme or by converting to ordinal integers.
53+
Compared to the existing :class:`OneHotEncoder`, this new class handles
54+
encoding of all feature types (also handles string-valued features) and
55+
derives the categories based on the unique values in the features instead of
56+
the maximum value in the features.
57+
By :user:`Vighnesh Birodkar <vighneshbirodkar>` and `Joris Van den Bossche`_.
58+
4859
Model evaluation
4960

5061
- Added the :func:`metrics.balanced_accuracy_score` metric and a corresponding

examples/ensemble/plot_feature_transformation.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
from sklearn.linear_model import LogisticRegression
3535
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
3636
GradientBoostingClassifier)
37-
from sklearn.preprocessing import OneHotEncoder
37+
from sklearn.preprocessing import CategoricalEncoder
3838
from sklearn.model_selection import train_test_split
3939
from sklearn.metrics import roc_curve
4040
from sklearn.pipeline import make_pipeline
@@ -62,7 +62,7 @@
6262

6363
# Supervised transformation based on random forests
6464
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
65-
rf_enc = OneHotEncoder()
65+
rf_enc = CategoricalEncoder()
6666
rf_lm = LogisticRegression()
6767
rf.fit(X_train, y_train)
6868
rf_enc.fit(rf.apply(X_train))
@@ -72,7 +72,7 @@
7272
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)
7373

7474
grd = GradientBoostingClassifier(n_estimators=n_estimator)
75-
grd_enc = OneHotEncoder()
75+
grd_enc = CategoricalEncoder()
7676
grd_lm = LogisticRegression()
7777
grd.fit(X_train, y_train)
7878
grd_enc.fit(grd.apply(X_train)[:, :, 0])

sklearn/feature_extraction/dict_vectorizer.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
3939
However, note that this transformer will only do a binary one-hot encoding
4040
when feature values are of type string. If categorical features are
4141
represented as numeric values such as int, the DictVectorizer can be
42-
followed by OneHotEncoder to complete binary one-hot encoding.
42+
followed by :class:`sklearn.preprocessing.CategoricalEncoder` to complete
43+
binary one-hot encoding.
4344
4445
Features that do not occur in a sample (mapping) will have a zero value
4546
in the resulting array/matrix.
@@ -88,8 +89,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
8889
See also
8990
--------
9091
FeatureHasher : performs vectorization using only a hash function.
91-
sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
92-
encoded as columns of integers.
92+
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
93+
features encoded as columns of arbitrary data types.
9394
"""
9495

9596
def __init__(self, dtype=np.float64, separator="=", sparse=True,

sklearn/preprocessing/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
from .data import minmax_scale
2323
from .data import quantile_transform
2424
from .data import OneHotEncoder
25+
from .data import CategoricalEncoder
2526

2627
from .data import PolynomialFeatures
2728

@@ -46,6 +47,7 @@
4647
'QuantileTransformer',
4748
'Normalizer',
4849
'OneHotEncoder',
50+
'CategoricalEncoder',
4951
'RobustScaler',
5052
'StandardScaler',
5153
'add_dummy_feature',

0 commit comments

Comments
 (0)
0