8000 ENH Adds TargetEncoder (#25334) · scikit-learn/scikit-learn@392fdee · GitHub
[go: up one dir, main page]

Skip to content

Commit 392fdee

Browse files
thomasjpfanamuellerogriseljovan-stojanovicglemaitre
authored
ENH Adds TargetEncoder (#25334)
Co-authored-by: Andreas Mueller <t3kcit@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jovan Stojanovic <62058944+jovan-stojanovic@users.noreply.github.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
1 parent d7f62b0 commit 392fdee

File tree

11 files changed

+1394
-1
lines changed

11 files changed

+1394
-1
lines changed

doc/images/target_encoder_cross_validation.svg

+3
Loading

doc/modules/classes.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1438,6 +1438,7 @@ details.
14381438
preprocessing.RobustScaler
14391439
preprocessing.SplineTransformer
14401440
preprocessing.StandardScaler
1441+
preprocessing.TargetEncoder
14411442

14421443
.. autosummary::
14431444
:toctree: generated/

doc/modules/preprocessing.rst

+86
Original file line numberDiff line numberDiff line change
@@ -830,6 +830,92 @@ lexicon order.
830830
>>> enc.infrequent_categories_
831831
[array(['b', 'c'], dtype=object)]
832832

833+
.. _target_encoder:
834+
835+
Target Encoder
836+
--------------
837+
838+
.. currentmodule:: sklearn.preprocessing
839+
840+
The :class:`TargetEncoder` uses the target mean conditioned on the categorical
841+
feature for encoding unordered categories, i.e. nominal categories [PAR]_
842+
[MIC]_. This encoding scheme is useful with categorical features with high
843+
cardinality, where one-hot encoding would inflate the feature space making it
844+
more expensive for a downstream model to process. A classical example of high
845+
cardinality categories are location based such as zip code or region. For the
846+
binary classification target, the target encoding is given by:
847+
848+
.. math::
849+
S_i = \lambda_i\frac{n_{iY}}{n_i} + (1 - \lambda_i)\frac{n_y}{n}
850+
851+
where :math:`S_i` is the encoding for category :math:`i`, :math:`n_{iY}` is the
852+
number of observations with :math:`Y=1` with category :math:`i`, :math:`n_i` is
853+
the number of observations with category :math:`i`, :math:`n_y` is the number of
854+
observations with :math:`Y=1`, :math:`n` is the number of observations, and
855+
:math:`\lambda_i` is a shrinkage factor. The shrinkage factor is given by:
856+
857+
.. math::
858+
\lambda_i = \frac{n_i}{m + n_i}
859+
860+
where :math:`m` is a smoothing factor, which is controlled with the `smooth`
861+
parameter in :class:`TargetEncoder`. Large smoothing factors will put more
862+
weight on the global mean. When `smooth="auto"`, the smoothing factor is
863+
computed as an empirical Bayes estimate: :math:`m=\sigma_c^2/\tau^2`, where
864+
:math:`\sigma_i^2` is the variance of `y` with category :math:`i` and
865+
:math:`\tau^2` is the global variance of `y`.
866+
867+
For continuous targets, the formulation is similar to binary classification:
868+
869+
.. math::
870+
S_i = \lambda_i\frac{\sum_{k\in L_i}y_k}{n_i} + (1 - \lambda_i)\frac{\sum_{k=1}^{n}y_k}{n}
871+
872+
where :math:`L_i` is the set of observations for which :math:`X=X_i` and
873+
:math:`n_i` is the cardinality of :math:`L_i`.
874+
875+
:meth:`~TargetEncoder.fit_transform` internally relies on a cross validation
876+
scheme to prevent information from the target from leaking into the train-time
877+
representation for non-informative high-cardinality categorical variables and
878+
help prevent the downstream model to overfit spurious correlations. Note that
879+
as a result, `fit(X, y).transform(X)` does not equal `fit_transform(X, y)`. In
880+
:meth:`~TargetEncoder.fit_transform`, the training data is split into multiple
881+
folds and encodes each fold by using the encodings trained on the other folds.
882+
After cross validation is complete in :meth:`~TargetEncoder.fit_transform`, the
883+
target encoder learns one final encoding on the whole training set. This final
884+
encoding is used to encode categories in :meth:`~TargetEncoder.transform`. The
885+
following diagram shows the cross validation scheme in
886+
:meth:`~TargetEncoder.fit_transform` with the default `cv=5`:
887+
888+
.. image:: ../images/target_encoder_cross_validation.svg
889+
:width: 600
890+
:align: center
891+
892+
The :meth:`~TargetEncoder.fit` method does **not** use any cross validation
893+
schemes and learns one encoding on the entire training set, which is used to
894+
encode categories in :meth:`~TargetEncoder.transform`.
895+
:meth:`~TargetEncoder.fit`'s one encoding is the same as the final encoding
896+
learned in :meth:`~TargetEncoder.fit_transform`.
897+
898+
.. note::
899+
:class:`TargetEncoder` considers missing values, such as `np.nan` or `None`,
900+
as another category and encodes them like any other category. Categories
901+
that are not seen during `fit` are encoded with the target mean, i.e.
902+
`target_mean_`.
903+
904+
.. topic:: Examples:
905+
906+
* :ref:`sphx_glr_auto_examples_preprocessing_plot_target_encoder.py`
907+
908+
.. topic:: References
909+
910+
.. [MIC] :doi:`Micci-Barreca, Daniele. "A preprocessing scheme for high-cardinality
911+
categorical attributes in classification and prediction problems"
912+
SIGKDD Explor. Newsl. 3, 1 (July 2001), 27–32. <10.1145/507533.507538>`
913+
914+
.. [PAR] :doi:`Pargent, F., Pfisterer, F., Thomas, J. et al. "Regularized target
915+
encoding outperforms traditional methods in supervised machine learning with
916+
high cardinality features" Comput Stat 37, 2671–2692 (2022)
917+
<10.1007/s00180-022-01207-6>`
918+
833919
.. _preprocessing_discretization:
834920

835921
Discretization

doc/whats_new/v1.3.rst

+4
Original file line numberDiff line numberDiff line change
@@ -322,6 +322,10 @@ Changelog
322322
:mod:`sklearn.preprocessing`
323323
............................
324324

325+
- |MajorFeature| Introduces :class:`preprocessing.TargetEncoder` which is a
326+
categorical encoding based on target mean conditioned on the value of the
327+
category. :pr:`25334` by `Thomas Fan`_.
328+
325329
- |Enhancement| Adds a `feature_name_combiner` parameter to
326330
:class:`preprocessing.OneHotEncoder`. This specifies a custom callable to create
327331
feature names to be returned by :meth:`get_feature_names_out`.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
"""
2+
============================================
3+
Comparing Target Encoder with Other Encoders
4+
============================================
5+
6+
.. currentmodule:: sklearn.preprocessing
7+
8+
The :class:`TargetEncoder` uses the value of the target to encode each
9+
categorical feature. In this example, we will compare three different approaches
10+
for handling categorical features: :class:`TargetEncoder`,
11+
:class:`OrdinalEncoder`, :class:`OneHotEncoder` and dropping the category.
12+
13+
.. note::
14+
`fit(X, y).transform(X)` does not equal `fit_transform(X, y)` because a
15+
cross-validation scheme is used in `fit_transform` for encoding. See the
16+
:ref:`User Guide <target_encoder>`. for details.
17+
"""
18+
19+
# %%
20+
# Loading Data from OpenML
21+
# ========================
22+
# First, we load the wine reviews dataset, where the target is the points given
23+
# be a reviewer:
24+
from sklearn.datasets import fetch_openml
25+
26+
wine_reviews = fetch_openml(data_id=42074, as_frame=True, parser="pandas")
27+
28+
df = wine_reviews.frame
29+
df.head()
30+
31+
# %%
32+
# For this example, we use the following subset of numerical and categorical
33+
# features in the data. The target are continuous values from 80 to 100:
34+
numerical_features = ["price"]
35+
categorical_features = [
36+
"country",
37+
"province",
38+
"region_1",
39+
"region_2",
40+
"variety",
41+
"winery",
42+
]
43+
target_name = "points"
44+
45+
X = df[numerical_features + categorical_features]
46+
y = df[target_name]
47+
48+
_ = y.hist()
49+
50+
# %%
51+
# Training and Evaluating Pipelines with Different Encoders
52+
# =========================================================
53+
# In this section, we will evaluate pipelines with
54+
# :class:`~sklearn.ensemble.HistGradientBoostingRegressor` with different encoding
55+
# strategies. First, we list out the encoders we will be using to preprocess
56+
# the categorical features:
57+
from sklearn.compose import ColumnTransformer
58+
from sklearn.preprocessing import OrdinalEncoder
59+
from sklearn.preprocessing import OneHotEncoder
60+
from sklearn.preprocessing import TargetEncoder
61+
62+
categorical_preprocessors = [
63+
("drop", "drop"),
64+
("ordinal", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),
65+
(
66+
"one_hot",
67+
OneHotEncoder(handle_unknown="ignore", max_categories=20, sparse_output=False),
68+
),
69+
("target", TargetEncoder(target_type="continuous")),
70+
]
71+
72+
# %%
73+
# Next, we evaluate the models using cross validation and record the results:
74+
from sklearn.pipeline import make_pipeline
10000
75+
from sklearn.model_selection import cross_validate
76+
from sklearn.ensemble import HistGradientBoostingRegressor
77+
78+
n_cv_folds = 3
79+
max_iter = 20
80+
results = []
81+
82+
83+
def evaluate_model_and_store(name, pipe):
84+
result = cross_validate(
85+
pipe,
86+
X,
87+
y,
88+
scoring="neg_root_mean_squared_error",
89+
cv=n_cv_folds,
90+
return_train_score=True,
91+
)
92+
rmse_test_score = -result["test_score"]
93+
rmse_train_score = -result["train_score"]
94+
results.append(
95+
{
96+
"preprocessor": name,
97+
"rmse_test_mean": rmse_test_score.mean(),
98+
"rmse_test_std": rmse_train_score.std(),
99+
"rmse_train_mean": rmse_train_score.mean(),
100+
"rmse_train_std": rmse_train_score.std(),
101+
}
102+
)
103+
104+
105+
for name, categorical_preprocessor in categorical_preprocessors:
106+
preprocessor = ColumnTransformer(
107+
[
108+
("numerical", "passthrough", numerical_features),
109+
("categorical", categorical_preprocessor, categorical_features),
110+
]
111+
)
112+
pipe = make_pipeline(
113+
preprocessor, HistGradientBoostingRegressor(random_state=0, max_iter=max_iter)
114+
)
115+
evaluate_model_and_store(name, pipe)
116+
117+
118+
# %%
119+
# Native Categorical Feature Support
120+
# ==================================
121+
# In this section, we build and evaluate a pipeline that uses native categorical
122+
# feature support in :class:`~sklearn.ensemble.HistGradientBoostingRegressor`,
123+
# which only supports up to 255 unique categories. In our dataset, the most of
124+
# the categorical features have more than 255 unique categories:
125+
n_unique_categories = df[categorical_features].nunique().sort_values(ascending=False)
126+
n_unique_categories
127+
128+
# %%
129+
# To workaround the limitation above, we group the categorical features into
130+
# low cardinality and high cardinality features. The high cardinality features
131+
# will be target encoded and the low cardinality features will use the native
132+
# categorical feature in gradient boosting.
133+
high_cardinality_features = n_unique_categories[n_unique_categories > 255].index
134+
low_cardinality_features = n_unique_categories[n_unique_categories <= 255].index
135+
mixed_encoded_preprocessor = ColumnTransformer(
136+
[
137+
("numerical", "passthrough", numerical_features),
138+
(
139+
"high_cardinality",
140+
TargetEncoder(target_type="continuous"),
141+
high_cardinality_features,
142+
),
143+
(
144+
"low_cardinality",
145+
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
146+
low_cardinality_features,
147+
),
148+
],
149+
verbose_feature_names_out=False,
150+
)
151+
152+
# The output of the of the preprocessor must be set to pandas so the
153+
# gradient boosting model can detect the low cardinality features.
154+
mixed_encoded_preprocessor.set_output(transform="pandas")
155+
mixed_pipe = make_pipeline(
156+
mixed_encoded_preprocessor,
157+
HistGradientBoostingRegressor(
158+
random_state=0, max_iter=max_iter, categorical_features=low_cardinality_features
159+
),
160+
)
161+
mixed_pipe
162+
163+
# %%
164+
# Finally, we evaluate the pipeline using cross validation and record the results:
165+
evaluate_model_and_store("mixed_target", mixed_pipe)
166+
167+
# %%
168+
# Plotting the Results
169+
# ====================
170+
# In this section, we display the results by plotting the test and train scores:
171+
import matplotlib.pyplot as plt
172+
import pandas as pd
173+
174+
results_df = (
175+
pd.DataFrame(results).set_index("preprocessor").sort_values("rmse_test_mean")
176+
)
177+
178+
fig, (ax1, ax2) = plt.subplots(
179+
1, 2, figsize=(12, 8), sharey=True, constrained_layout=True
180+
)
181+
xticks = range(len(results_df))
182+
name_to_color = dict(
183+
zip((r["preprocessor"] for r in results), ["C0", "C1", "C2", "C3", "C4"])
184+
)
185+
186+
for subset, ax in zip(["test", "train"], [ax1, ax2]):
187+
mean, std = f"rmse_{subset}_mean", f"rmse_{subset}_std"
188+
data = results_df[[mean, std]].sort_values(mean)
189+
ax.bar(
190+
x=xticks,
191+
height=data[mean],
192+
yerr=data[std],
193+
width=0.9,
194+
color=[name_to_color[name] for name in data.index],
195+
)
196+
ax.set(
197+
title=f"RMSE ({subset.title()})",
198+
xlabel="Encoding Scheme",
199+
xticks=xticks,
200+
xticklabels=data.index,
201+
)
202+
203+
# %%
204+
# When evaluating the predictive performance on the test set, dropping the
205+
# categories perform the worst and the target encoders performs the best. This
206+
# can be explained as follows:
207+
#
208+
# - Dropping the categorical features makes the pipeline less expressive and
209+
# underfitting as a result;
210+
# - Due to the high cardinality and to reduce the training time, the one-hot
211+
# encoding scheme uses `max_categories=20` which prevents the features 48DA from
212+
# expanding too much, which can result in underfitting.
213+
# - If we had not set `max_categories=20`, the one-hot encoding scheme would have
214+
# likely made the pipeline overfitting as the number of features explodes with rare
215+
# category occurrences that are correlated with the target by chance (on the training
216+
# set only);
217+
# - The ordinal encoding imposes an arbitrary order to the features which are then
218+
# treated as numerical values by the
219+
# :class:`~sklearn.ensemble.HistGradientBoostingRegressor`. Since this
220+
# model groups numerical features in 256 bins per feature, many unrelated categories
221+
# can be grouped together and as a result overall pipeline can underfit;
222+
# - When using the target encoder, the same binning happens, but since the encoded
223+
# values are statistically ordered by marginal association with the target variable,
224+
# the binning use by the :class:`~sklearn.ensemble.HistGradientBoostingRegressor`
225+
# makes sense and leads to good results: the combination of smoothed target
226+
# encoding and binning works as a good regularizing strategy against
227+
# overfitting while not limiting the expressiveness of the pipeline too much.

setup.py

+6
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,12 @@ def check_package_status(package, min_version):
295295
],
296296
"preprocessing": [
297297
{"sources": ["_csr_polynomial_expansion.pyx"], "include_np": True},
298+
{
299+
"sources": ["_target_encoder_fast.pyx"],
300+
"include_np": True,
301+
"language": "c++",
302+
"extra_compile_args": ["-std=c++11"],
303+
},
298304
],
299305
"neighbors": [
300306
{"sources": ["_ball_tree.pyx"], "include_np": True},

sklearn/preprocessing/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626

2727
from ._encoders import OneHotEncoder
2828
from ._encoders import OrdinalEncoder
29+
from ._target_encoder import TargetEncoder
2930

3031
from ._label import label_binarize
3132
from ._label import LabelBinarizer
@@ -56,6 +57,7 @@
5657
"RobustScaler",
5758
"SplineTransformer",
5859
"StandardScaler",
60+
"TargetEncoder",
5961
"add_dummy_feature",
6062
"PolynomialFeatures",
6163
"binarize",

sklearn/preprocessing/_encoders.py

+7-1
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,7 @@ class OneHotEncoder(_BaseEncoder):
415415
--------
416416
OrdinalEncoder : Performs an ordinal (integer)
417417
encoding of the categorical features.
418+
TargetEncoder : Encodes categorical features using the target.
418419
sklearn.feature_extraction.DictVectorizer : Performs a one-hot encoding of
419420
dictionary items (also handles string-valued features).
420421
sklearn.feature_extraction.FeatureHasher : Performs an approximate one-hot
@@ -1229,7 +1230,12 @@ class OrdinalEncoder(OneToOneFeatureMixin, _BaseEncoder):
12291230
12301231
See Also
12311232
--------
1232-
OneHotEncoder : Performs a one-hot encoding of categorical features.
1233+
OneHotEncoder : Performs a one-hot encoding of categorical features. This encoding
1234+
is suitable for low to medium cardinality categorical variables, both in
1235+
supervised and unsupervised settings.
1236+
TargetEncoder : Encodes categorical features using supervised signal
1237+
in a classification or regression pipeline. This encoding is typically
1238+
suitable for high cardinality categorical variables.
12331239
LabelEncoder : Encodes target labels with values between 0 and
12341240
``n_classes-1``.
12351241

0 commit comments

Comments
 (0)
0