-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH Allows multiclass target in TargetEncoder
#26674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the ordering:
feat0_class0, feat0_class1, feat0_class2, feat1_class0
At this point, it'll good to prioritize writing tests to make sure multiclass gives reasonable results.
n_classes = self._label_binarizer_.classes_.shape[0] | ||
X_ordinal, X_valid = [ | ||
np.repeat(X, n_classes, axis=1) for X in (X_ordinal, X_valid) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here regarding not needing to repeat X_ordinal
and X_valid
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This LGTM besides the following suggestions:
In the multiclass case, `X_ordinal` and `X_unknown_mask` have column | ||
(axis=1) size `n_features`, while `encodings` has length of size | ||
`n_features * n_classes`. `feat_idx` deals with this by repeating | ||
feature indicies by `n_classes` E.g., for 3 features, 2 classes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this suggestion was not applied.
@ogrisel thank you for the review, changes made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. Thanks very much for the PR @lucyleeow! I pushed a few more small improvements / fixes and I will merge if CI is green.
Thanks @ogrisel and @thomasjpfan ! |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
closes #26613
What does this implement/fix? Explain your changes.
Allows multiclass target type in
TargetEncoder
, following section 3.3 of Micci-Barreca et al.Uses
LabelBinarizer
to perform on vs rest ony
and for each feature and calculates one vs rest target mean for each class, thus expanding number of features ton_features * n_classes
.Any other comments?
First attempt, needs more thought on some aspects.
I am conflicted on the best order of the output features. Currently the order of features is:
I think grouping features may make more sense:
which should not be too computationally expensive, should just require an additional re-ordering of
encodings_
(list of ndarray), which can be done via list comprehension using list of reordering indices.Any suggestions welcome.
EDIT: have now amended such that same features are grouped together.
TODO:
get_feature_names_out
for new features names that include classescc @thomasjpfan