-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] DOC clarify LabelEncoder's docstring #14642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I originally emphatically agreed with this. But now we're using this to implement OneHotEncoder, right? So I'm not so sure any more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller we don't directly use LabelEncoder
in OneHotEncoder
, just some of the utils defined in label.py
Nit but LGTM
@NicolasHug huh that must be a refactoring that I missed (unsuprisingly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad for the increased reference to targets
sklearn/preprocessing/label.py
Outdated
using an ordinal encoding scheme. | ||
|
||
sklearn.preprocessing.OneHotEncoder : Encode categorical integer features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support non integers, though the ohe docstring still emphasises integers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copy pasted the docstring form ohe, I'll fix both then.
sklearn/preprocessing/label.py
Outdated
@@ -208,8 +211,11 @@ class LabelEncoder(BaseEstimator, TransformerMixin): | |||
|
|||
See also | |||
-------- | |||
sklearn.preprocessing.OrdinalEncoder : encode categorical features | |||
sklearn.preprocessing.OrdinalEncoder : Encode categorical input features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why "input features" is clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some people's minds, "features" may not necessarily be strongly associated with "input". Once you start think of features, as inputs, then it's very clear, before that, you just have a matrix, which a bunch of columns/features, and one of those columns happens to be your output. You use the rest of the matrix to predict that single column. From that perspective, a feature can be input, or output. I have had usecases where we'd take a matrix, and do basically a for loop over columns and try to predict them using the other ones!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... In my experience the use of "features" means things that can be observed (or imputed) at test time. "Variable" has the mixed meaning of independent observed predictor and dependent target response.
For me "input feature" contrasts with the feature representation as output by transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree with you in general, but my point in this PR was that "don't use this if you're dealing with input", I thought putting emphasis on that would help. Cause the same way you could argue that target label is also self explanatory and it's the output, but people still don't realize that and end up trying to use this for input features. I removed it anyway.
Thanks. I don't think "target" is universal terminology either... |
This PR is an effort to emphasize that
LabelEncoder
should not be used to encode input features. I was reminded of this when I went back to check #12086 . I remember I did the same mistake back in the day.