-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds Target Regression Encoder (Impact Encoder) #17323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds Target Regression Encoder (Impact Encoder) #17323
Conversation
I'm not sure what you're illustrating in 2). That in this case there's very little smoothing? That's good, right? This seems like the encoding I'd want? |
I think this is great, though haven't reviewed in detail yet. |
doc/modules/preprocessing.rst
Outdated
the target conditioned on the categorical feature. The target encoding scheme | ||
takes a weighted average of the overall target mean and the target mean | ||
conditioned on categories. A multilevel linear model, where the levels are the | ||
categories, is used to construct the weighted average is estimated in Chapter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They call it pooled average, right? Maybe say that? Also maybe provide the other references, even though not entirely applicable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe say that this is the "GLM" version of target encoding and reference Max Kuhn or something and also the thesis I posted? I feel like it's hard to have too many references ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it just "multilevel linear model", i.e. not "generalized"? Or did I miss something essential?
Just as info: A multilevel linear model with only a random intercept, i.e. the approach chosen here, is equivalent/aka (Bühlmann Straub) "credibility estimator". Maybe too much information for the UG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't checked the test yet, otherwise looks good.
sklearn/preprocessing/_encoders.py
Outdated
@@ -19,6 +21,25 @@ | |||
] | |||
|
|||
|
|||
def _get_counts(values, uniques): | |||
"""Get the number of times each of the values comes up `values` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Get the number of times each of the values comes up `values` | |
"""Get the number of times each of the values comes up in `values` |
This is just bincounts for integers and value_counts if we had pandas, right?
doc/modules/preprocessing.rst
Outdated
the target conditioned on the categorical feature. The target encoding scheme | ||
takes a weighted average of the overall target mean and the target mean | ||
conditioned on categories. A multilevel linear model, where the levels are the | ||
categories, is used to construct the weighted average is estimated in Chapter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe say that this is the "GLM" version of target encoding and reference Max Kuhn or something and also the thesis I posted? I feel like it's hard to have too many references ;)
sklearn.preprocessing.OrdinalEncoder : Performs an ordinal (integer) | ||
encoding of the categorical features. | ||
sklearn.preprocessing.OneHotEncoder : Performs a one-hot encoding of | ||
categorical features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add references here as well? We're not super consistent with that, are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid references in the docstrings because they lack context, and we sometimes end up with big lists of refs that are impractical because some of them are about super specific details.
It's better to have them in the UG so that we can explicitly refer to them as in "this is detailed in blahblah" or "this was introduced in ...".
In my recent UG-cleaning PRs, I've removed refs from docstrings.
|
||
Attributes | ||
---------- | ||
cat_encodings_ : list of ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth giving the shape of this somehow? Took me a minute to understand the short description.
"For each feature, provide the encoding corresponding to each category, in the order of categories_
?"
cat_var_ratio = np.ones(n_cats, dtype=float) | ||
|
||
for encoding in range(n_cats): | ||
np.equal(X_int[:, i], encoding, out=tmp_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that the same as tmp_mask[:] = X_int[:, i] == encoding
and you find this one more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used np.equal
so we do not need to allocated more memory in the inner for loop.
From my understanding, the right hand side will allocate an ndarray and then copy it over to tmp_mask
.
tmp_mask[:] = X_int[:, i] == encoding
cat_means = np.zeros(n_cats, dtype=float) | ||
cat_var_ratio = np.ones(n_cats, dtype=float) | ||
|
||
for encoding in range(n_cats): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly afraid of this being slow but I'm not sure how to do it quickly without pandas or cython. I guess if it's integer encoded you could do some csr_matrix
trick? Is it worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc/modules/preprocessing.rst
Outdated
.. [GEL] Andrew Gelman and Jennifer Hill. Data Analysis Using Regression | ||
and Multilevel/Hierarchical Models. Cambridge University Press, 2007 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a $50+ textbook, can we have a free reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a pdf online, right? But yes to more references like the paper that no-one cites and the masters' thesis ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not know if it is okay to share those link to the book.
@@ -593,6 +593,51 @@ the 2 features with respectively 3 categories each. | |||
See :ref:`dict_feature_extraction` for categorical features that are | |||
represented as a dict, not as scalars. | |||
|
|||
.. _target_regressor_encoder: | |||
|
|||
Target Regressor Encoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a subsection of the "Encoding Categorical Feature" above, not a separate one.
Maybe you'll need to create subsection for OE and OHE too.
doc/modules/preprocessing.rst
Outdated
conditioned on categories. A multilevel linear model, where the levels are the | ||
categories, is used to construct the weighted average is estimated in Chapter | ||
12 of [GEL]_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is grammatically incorrect
encoding for `'cat'` is pulled toward the overall mean of `53` when compared to | ||
`'dog'` because the `'cat'` category appears less frequently:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't obvious at first since the encoding for dog is still much closer to the global mean than the encoding for cat.
It might be clearer if the mean for dog was significantly different from 53
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are so many dogs, it's mean will be fairly close to the global maen.
sklearn/preprocessing/_encoders.py
Outdated
@@ -19,6 +21,25 @@ | |||
] | |||
|
|||
|
|||
def _get_counts(values, uniques): | |||
"""Get the number of times each of the values comes up `values` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sentence is not correct?
# unknown | ||
X_trans = enc.transform([unknown_X]) | ||
assert_allclose(X_trans, [[y.mean()]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this relevant to test that here?
assert_allclose(X_trans[-1], [y.mean()]) | ||
|
||
assert len(enc.cat_encodings_) == 1 | ||
# unknown category seen during fitting is mapped to the mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this instead a known category unseen during fitting?
i.e. 2 and 'cow'?
y_mean = np.mean(y) | ||
|
||
# manually compute multilevel partial pooling | ||
feat_0_cat_0_encoding = ((4 * 4.0 / 5.0 + 4.0 / 9.5) / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe avoid using the .0
notation which adds noise
# check known test data | ||
X_test = np.array([[2, 0, 1]], dtype=int).T | ||
X_input = categories[X_test] | ||
X_trans = enc.transform(X_input) | ||
expected_encoding = cat_encoded[X_test] | ||
assert_allclose(expected_encoding, X_trans) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part does not seem to test anything more than the one just above with the training data? Could it be removed?
doc/modules/preprocessing.rst
Outdated
|
||
Target Regressor Encoder | ||
======================== | ||
The :class:`~sklearn.preprocessing.TargetRegressorEncoder` uses statistics of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should start by describing in which cases this Encoder could or should be used. AFAIK it's useful when there are lots of unordered categories, so the OHE would be too expensive? Zipcodes sounds like the go-to illustration use-case?
@thomasjpfan this looks really useful, and I am excited you are adding it to sklearn! Out of curiosity, is this similar to the encoding scheme as used by (category_encoders.target_encoder.TargetEncoder)[http://contrib.scikit-learn.org/category_encoders/targetencoder.html]? |
@jnothman If the house price does depend on population density, then the marginal mean of house price per zip code does also depend on population density. If this marginal mean is then used to encode zip code (e.g. instead of OHE), for linear models or certain interpretability/explainability tools, the effect of (residual, i.e. without pop density) zip code is blurred by the effect of population density. |
Yes, this is certainly not a tool for improving interpretability.
|
As for naming, I would want the word "Encoder" somewhere in the name. In the end, I am thinking of having two encoders, one for categorical targets and another for regression targets. Maybe: Initially, I had it named |
Interesting, #18012 just linked the package feature-engine in the docs, which calls it |
I'd rather CategoryMeanEncoder in terms of language...
|
Related to this PR, here is very interesting report on the TargetEncoder implemented in cuML that was used by the RAPIDS.ai team to win the RecSys 2020 challenge. In particular, it's very interesting that they identified that the internal CV is really needed to avoid introducing a distribution shift between train and test for the downstream classifier. There are also notes about scalability and parallelism to tackle very large datasets efficiently because apparently this pre-processing was a significant part of the computational load of their winning pipeline for this large scale problem (because of the internal CV). That being said, the |
There's another benchmark on using different categorical encoders (medium)(github) which might be of interest for this PR as well. The gist is that using CV is crucial for any supervised categorical encoding as was found by the NVIDIA team winning the RecSYS challenge. In addition to CV, there's also the possibility to add a 2nd layer of CV (as I understand, the implementation from CuML only does one layer of CV) advertised to reduce overfitting even further, which is also benchmarked in this medium blogpost. |
Thanks very much. For those annoyed by the medium paywall, just open the URL in a private browsing session. |
This was the primary reason why I did not use an internal CV for this PR. When I was looking at sources on impact/target encoders, I commonly saw that a CV was important. The current implementation is trying to automatically weight the "group mean" and the "overall mean" by multilevel estimates of the mean for each group. |
That's the James-Stein encoder, that you're implementing, right? |
Oooo it is! It is nice to know this encoding scheme has a name. |
|
||
cat_encoded = cat_counts * cat_means + cat_var_ratio * y_mean | ||
cat_encoded /= cat_counts + cat_var_ratio | ||
cat_encodings.append(cat_encoded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that this matches the James Stein estimator as described in this post:
https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check how this implementation is related to https://github.com/scikit-learn-contrib/category_encoders/blob/master/category_encoders/james_stein.py
According to https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8 , even when using the James Stein encoder you need to do some form of nested cross-validation, otherwise the generalization performance of the overall pipeline is degraded. |
To summarize some offline discussions: an alternative to the strategy "cross_valid_predict to fit_transform the training set for the downstream classifier, then retrain encoder on full train data to be used to transform test data" that leads to the .fit_transform != .fit.transform discrepancy, we could also consider the following bagging strategy "train k encoders on k sub/resampled training sets and then use the k encoders averaged predictions to transform both the training and testing data for the downstream classifier". This strategy would not induce the .fit_transform != .fit.transform discrepancy and while exactly nested CV, it should allow us to mitigate most of the train / test distribution shift induced by using supervised encoders as pre-processors. This bagging strategy could also be useful for general (multivariate) stacking ensembles. But AFAIK, this strategy has not been well studied in the literature, so it could be considered problematic to implement in scikit-learn as it kind breaks our rule of only implementing standard, "not invented here" methods. |
The strategy above would be very similar to |
There might be yet another good option to avoid overfitting on the training set.
Another, statistically well founded option is to use one and the same value for all
In addition, I'd use the homogeneous estimator for Reference:
|
@lorentzenchr That looks interesting. Do you have a reference for classification targets? Going through the code and diagrams of the blog post mentioned in the comments, I think all of them include a sense of bagging. The "single" and "double" validation both generates models on different folds. During prediction time, the predictions are combined as shown here. I extended this PR with benchmarks here: https://github.com/thomasjpfan/sk_encoder_cv where the readme shows the results. All the datasets are from openml.
TLDR: For the most part the CV version of the target encoder does better or on par with the non-cv version. For telco or amazon_access datasets, the CV versions does quite a bit beter. |
I recently added more benchmarks with more datasets with cv=10 here: https://github.com/thomasjpfan/sk_encoder_cv. Note I do not show datasets where all the encoders perform the same. The "bagging" approach used 10 bootstrap samples to train 10 encoders and Thoughts:
|
Reference Issues/PRs
Partially Addresses #5853
Closes #9614
What does this implement/fix? Explain your changes.
There are some interesting points about this scheme:
n_j
increases then the mean conditioned on a category plays a bigger role.Any other comments?
There are ways to run encode with
CV
during train time but that would break the our API contract of havingfit().transform()
==fit_transform()
.This is missing an example that shows where it is more useful when compared to
OneHotEncoder
.The classifier version of this encoder will follow if we decide this encoder should be included in
scikit-learn
.