8000 [MRG] DOC clarify LabelEncoder's docstring by adrinjalali · Pull Request #14642 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] DOC clarify LabelEncoder's docstring #14642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 14, 2019

Conversation

adrinjalali
Copy link
Member

This PR is an effort to emphasize that LabelEncoder should not be used to encode input features. I was reminded of this when I went back to check #12086 . I remember I did the same mistake back in the day.

@adrinjalali adrinjalali changed the title clarify LabelEncoder's docstring DOC clarify LabelEncoder's docstring Aug 13, 2019
@adrinjalali adrinjalali changed the title DOC clarify LabelEncoder's docstring [MRG] DOC clarify LabelEncoder's docstring Aug 13, 2019
@amueller
Copy link
Member

I originally emphatically agreed with this. But now we're using this to implement OneHotEncoder, right? So I'm not so sure any more.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller we don't directly use LabelEncoder in OneHotEncoder, just some of the utils defined in label.py

Nit but LGTM

@amueller
Copy link
Member

@NicolasHug huh that must be a refactoring that I missed (unsuprisingly)

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad for the increased reference to targets

using an ordinal encoding scheme.

sklearn.preprocessing.OneHotEncoder : Encode categorical integer features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support non integers, though the ohe docstring still emphasises integers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copy pasted the docstring form ohe, I'll fix both then.

@@ -208,8 +211,11 @@ class LabelEncoder(BaseEstimator, TransformerMixin):

See also
--------
sklearn.preprocessing.OrdinalEncoder : encode categorical features
sklearn.preprocessing.OrdinalEncoder : Encode categorical input features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why "input features" is clearer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some people's minds, "features" may not necessarily be strongly associated with "input". Once you start think of features, as inputs, then it's very clear, before that, you just have a matrix, which a bunch of columns/features, and one of those columns happens to be your output. You use the rest of the matrix to predict that single column. From that perspective, a feature can be input, or output. I have had usecases where we'd take a matrix, and do basically a for loop over columns and try to predict them using the other ones!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... In my experience the use of "features" means things that can be observed (or imputed) at test time. "Variable" has the mixed meaning of independent observed predictor and dependent target response.

For me "input feature" contrasts with the feature representation as output by transform

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with you in general, but my point in this PR was that "don't use this if you're dealing with input", I thought putting emphasis on that would help. Cause the same way you could argue that target label is also self explanatory and it's the output, but people still don't realize that and end up trying to use this for input features. I removed it anyway.

@jnothman jnothman merged commit f749bed into scikit-learn:master Aug 14, 2019
@jnothman
Copy link
Member

Thanks. I don't think "target" is universal terminology either...

@adrinjalali adrinjalali deleted the doc/labelencoder branch August 15, 2019 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0