8000 FEA Refactor CategoricalEncoder into OneHotEncoder and OrdinalEncoder… · scikit-learn/scikit-learn@007aa71 · GitHub
[go: up one dir, main page]

Skip to content

Commit 007aa71

Browse files
jorisvandenbosschejnothman
authored andcommitted
FEA Refactor CategoricalEncoder into OneHotEncoder and OrdinalEncoder (#10523)
Deprecated some OneHotEncoder behaviour
1 parent bb5110b commit 007aa71

File tree

17 files changed

+1454
-1075
lines changed

17 files changed

+1454
-1075
lines changed

doc/datasets/index.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -456,9 +456,8 @@ refer to:
456456
for reading WAV files into a numpy array
457457

458458
Categorical (or nominal) features stored as strings (common in pandas DataFrames)
459-
will need converting to integers, and integer categorical variables may be best
460-
exploited when encoded as one-hot variables
461-
(:class:`sklearn.preprocessing.OneHotEncoder`) or similar.
459+
will need converting to numerical features using :class:`sklearn.preprocessing.OneHotEncoder`
460+
or :class:`sklearn.preprocessing.OrdinalEncoder` or similar.
462461
See :ref:`preprocessing`.
463462

464463
Note: if you manage your own numerical data it is recommended to use an

doc/glossary.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,8 +165,10 @@ General Concepts
165165
tree-based models such as random forests and gradient boosting
166166
models that often work better and faster with integer-coded
167167
categorical variables.
168-
:class:`~sklearn.preprocessing.CategoricalEncoder` helps
169-
encoding string-valued categorical features.
168+
:class:`~sklearn.preprocessing.OrdinalEncoder` helps encoding
169+
string-valued categorical features as ordinal integers, and
170+
:class:`~sklearn.preprocessing.OneHotEncoder` can be used to
171+
one-hot encode categorical features.
170172
See also :ref:`preprocessing_categorical_features` and the
171173
`http://contrib.scikit-learn.org/categorical-encoding
172174
<category_encoders>`_ package for tools related to encoding

doc/modules/classes.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1248,7 +1248,7 @@ Model validation
12481248
preprocessing.MinMaxScaler
12491249
preprocessing.Normalizer
12501250
preprocessing.OneHotEncoder
1251-
preprocessing.CategoricalEncoder
1251+
preprocessing.OrdinalEncoder
12521252
preprocessing.PolynomialFeatures
12531253
preprocessing.PowerTransformer
12541254
preprocessing.QuantileTransformer

doc/modules/preprocessing.rst

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -508,15 +508,13 @@ Such features can be efficiently coded as integers, for instance
508508
``[1, 2, 1]``.
509509

510510
To convert categorical features to such integer codes, we can use the
511-
:class:`CategoricalEncoder`. When specifying that we want to perform an
512-
ordinal encoding, the estimator transforms each categorical feature to one
511+
:class:`OrdinalEncoder`. This estimator transforms each categorical feature to one
513512
new feature of integers (0 to n_categories - 1)::
514513

515-
>>> enc = preprocessing.CategoricalEncoder(encoding='ordinal')
514+
>>> enc = preprocessing.OrdinalEncoder()
516515
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
517516
>>> enc.fit(X) # doctest: +ELLIPSIS
518-
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
519-
encoding='ordinal', handle_unknown='error')
517+
OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>)
520518
>>> enc.transform([['female', 'from US', 'uses Safari']])
521519
array([[0., 1., 1.]])
522520

@@ -528,18 +526,19 @@ browsers was ordered arbitrarily).
528526
Another possibility to convert categorical features to features that can be used
529527
with scikit-learn estimators is to use a one-of-K, also known as one-hot or
530528
dummy encoding.
531-
This type of encoding is the default behaviour of the :class:`CategoricalEncoder`.
532-
The :class:`CategoricalEncoder` then transforms each categorical feature with
529+
This type of encoding can be obtained with the :class:`OneHotEncoder`,
530+
which transforms each categorical feature with
533531
``n_categories`` possible values into ``n_categories`` binary features, with
534532
one of them 1, and all others 0.
535533

536534
Continuing the example above::
537535

538-
>>> enc = preprocessing.CategoricalEncoder()
536+
>>> enc = preprocessing.OneHotEncoder()
539537
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
540538
>>> enc.fit(X) # doctest: +ELLIPSIS
541-
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
542-
encoding='onehot', handle_unknown='error')
539+
OneHotEncoder(categorical_features=None, categories=None,
540+
dtype=<... 'numpy.float64'>, handle_unknown='error',
541+
n_values=None, sparse=True)
543542
>>> enc.transform([['female', 'from US', 'uses Safari'],
544543
... ['male', 'from Europe', 'uses Safari']]).toarray()
545544
array([[1., 0., 0., 1., 0., 1.],
@@ -558,14 +557,15 @@ dataset::
558557
>>> genders = ['female', 'male']
559558
>>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
560559
>>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
561-
>>> enc = preprocessing.CategoricalEncoder(categories=[genders, locations, browsers])
560+
>>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
562561
>>> # Note that for there are missing categorical values for the 2nd and 3rd
563562
>>> # feature
564563
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
565564
>>> enc.fit(X) # doctest: +ELLIPSIS
566-
CategoricalEncoder(categories=[...],
567-
dtype=<... 'numpy.float64'>, encoding='onehot',
568-
handle_unknown='error')
565+
OneHotEncoder(categorical_features=None,
566+
categories=[...],
567+
dtype=<... 'numpy.float64'>, handle_unknown='error',
568+
n_values=None, sparse=True)
569569
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
570570
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
571571

@@ -577,11 +577,12 @@ during transform, no error will be raised but the resulting one-hot encoded
577577
columns for this feature will be all zeros
578578
(``handle_unknown='ignore'`` is only supported for one-hot encoding)::
579579

580-
>>> enc = preprocessing.CategoricalEncoder(handle_unknown='ignore')
580+
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
581581
>>> X = [['male', F438 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
582582
>>> enc.fit(X) # doctest: +ELLIPSIS
583-
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
584-
encoding='onehot', handle_unknown='ignore')
583+
OneHotEncoder(categorical_features=None, categories=None,
584+
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
585+
n_values=None, sparse=True)
585586
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
586587
array([[1., 0., 0., 0., 0., 0.]])
587588

doc/whats_new/v0.20.rst

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,14 @@ Classifiers and regressors
7070

7171
Preprocessing
7272

73-
- Added :class:`preprocessing.CategoricalEncoder`, which allows to encode
74-
categorical features as a numeric array, either using a one-hot (or dummy)
75-
encoding scheme or by converting to ordinal integers. Compared to the
76-
existing :class:`~preprocessing.OneHotEncoder`, this new class handles
73+
- Expanded :class:`preprocessing.OneHotEncoder` to allow to encode
74+
categorical string features as a numeric array using a one-hot (or dummy)
75+
encoding scheme, and added :class:`preprocessing.OrdinalEncoder` to
76+
convert to ordinal integers. Those two classes now handle
7777
encoding of all feature types (also handles string-valued features) and
7878
derives the categories based on the unique values in the features instead of
79-
the maximum value in the features. :issue:`9151` by :user:`Vighnesh Birodkar
80-
<vighneshbirodkar>` and `Joris Van den Bossche`_.
79+
the maximum value in the features. :issue:`9151` and :issue:`10521` by
80+
:user:`Vighnesh Birodkar <vighneshbirodkar>` and `Joris Van den Bossche`_.
8181

8282
- Added :class:`compose.ColumnTransformer`, which allows to apply
8383
different transformers to different columns of arrays or pandas
@@ -584,6 +584,17 @@ Linear, kernelized and related models
584584
:class:`linear_model.LogisticRegression` when ``verbose`` is set to 0.
585585
:issue:`10881` by :user:`Alexandre Sevin <AlexandreSev>`.
586586

587+
Preprocessing
588+
589+
- Deprecate ``n_values`` and ``categorical_features`` parameters and
590+
``active_features_``, ``feature_indices_`` and ``n_values_`` attributes
591+
of :class:`preprocessing.OneHotEncoder`. The ``n_values`` parameter can be
592+
replaced with the new ``categories`` parameter, and the attributes with the
593+
new ``categories_`` attribute. Selecting the categorical features with
594+
the ``categorical_features`` parameter is now better supported using the
595+
:class:`compose.ColumnTransformer`.
596+
:issue:`10521` by `Joris Van den Bossche`_.
597+
587598
Decomposition, manifold learning and clustering
588599

589600
- Deprecate ``precomputed`` parameter in function

examples/ensemble/plot_feature_transformation.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@
3434
from sklearn.linear_model import LogisticRegression
3535
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
3636
GradientBoostingClassifier)
37-
from sklearn.preprocessing import CategoricalEncoder
37+
from sklearn.preprocessing import OneHotEncoder
3838
from sklearn.model_selection import train_test_split
3939
from sklearn.metrics import roc_curve
4040
from sklearn.pipeline import make_pipeline
@@ -62,7 +62,7 @@
6262

6363
# Supervised transformation based on random forests
6464
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
65-
rf_enc = CategoricalEncoder()
65+
rf_enc = OneHotEncoder()
6666
rf_lm = LogisticRegression()
6767
rf.fit(X_train, y_train)
6868
rf_enc.fit(rf.apply(X_train))
@@ -72,7 +72,7 @@
7272
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)
7373

7474
grd = GradientBoostingClassifier(n_estimators=n_estimator)
75-
grd_enc = CategoricalEncoder()
75+
grd_enc = OneHotEncoder()
7676
grd_lm = LogisticRegression()
7777
grd.fit(X_train, y_train)
7878
grd_enc.fit(grd.apply(X_train)[:, :, 0])

sklearn/compose/_column_transformer.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -636,19 +636,19 @@ def make_column_transformer(*transformers, **kwargs):
636636
637637
Examples
638638
--------
639-
>>> from sklearn.preprocessing import StandardScaler, CategoricalEncoder
639+
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
640640
>>> from sklearn.compose import make_column_transformer
641641
>>> make_column_transformer(
642642
... (['numerical_column'], StandardScaler()),
643-
... (['categorical_column'], CategoricalEncoder()))
643+
... (['categorical_column'], OneHotEncoder()))
644644
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
645645
ColumnTransformer(n_jobs=1, remainder='passthrough',
646646
transformer_weights=None,
647647
transformers=[('standardscaler',
648648
StandardScaler(...),
649649
['numerical_column']),
650-
('categoricalencoder',
651-
CategoricalEncoder(...),
650+
('onehotencoder',
651+
OneHotEncoder(...),
652652
['categorical_column'])])
653653
654654
"""

sklearn/ensemble/forest.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1932,7 +1932,8 @@ def fit_transform(self, X, y=None, sample_weight=None):
19321932
super(RandomTreesEmbedding, self).fit(X, y,
19331933
sample_weight=sample_weight)
19341934

1935-
self.one_hot_encoder_ = OneHotEncoder(sparse=self.sparse_output)
1935+
self.one_hot_encoder_ = OneHotEncoder(sparse=self.sparse_output,
1936+
categories='auto')
19361937
return self.one_hot_encoder_.fit_transform(self.apply(X))
19371938

19381939
def transform(self, X):

sklearn/feature_extraction/dict_vectorizer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
3939
However, note that this transformer will only do a binary one-hot encoding
4040
when feature values are of type string. If categorical features are
4141
represented as numeric values such as int, the DictVectorizer can be
42-
followed by :class:`sklearn.preprocessing.CategoricalEncoder` to complete
42+
followed by :class:`sklearn.preprocessing.OneHotEncoder` to complete
4343
binary one-hot encoding.
4444
4545
Features that do not occur in a sample (mapping) will have a zero value
@@ -89,7 +89,7 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
8989
See also
9090
--------
9191
FeatureHasher : performs vectorization using only a hash function.
92-
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
92+
sklearn.preprocessing.OrdinalEncoder : handles nominal/categorical
9393
features encoded as columns of arbitrary data types.
9494
"""
9595

sklearn/feature_extraction/hashing.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,7 @@ class FeatureHasher(BaseEstimator, TransformerMixin):
8181
See also
8282
--------
8383
DictVectorizer : vectorizes string-valued features using a hash table.
84-
sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
85-
encoded as columns of integers.
84+
sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features.
8685
"""
8786

8887
def __init__(self, n_features=(2 ** 20), input_type="dict",

0 commit comments

Comments
 (0)
0