[MRG] DOC clarify LabelEncoder's docstring #14642

adrinjalali · 2019-08-13T11:54:12Z

This PR is an effort to emphasize that LabelEncoder should not be used to encode input features. I was reminded of this when I went back to check #12086 . I remember I did the same mistake back in the day.

amueller · 2019-08-13T16:58:53Z

I originally emphatically agreed with this. But now we're using this to implement OneHotEncoder, right? So I'm not so sure any more.

NicolasHug

@amueller we don't directly use LabelEncoder in OneHotEncoder, just some of the utils defined in label.py

Nit but LGTM

sklearn/preprocessing/label.py

amueller · 2019-08-13T18:11:29Z

@NicolasHug huh that must be a refactoring that I missed (unsuprisingly)

jnothman

Glad for the increased reference to targets

jnothman · 2019-08-13T20:59:11Z

sklearn/preprocessing/label.py

        using an ordinal encoding scheme.
+
+    sklearn.preprocessing.OneHotEncoder : Encode categorical integer features


We support non integers, though the ohe docstring still emphasises integers

I just copy pasted the docstring form ohe, I'll fix both then.

jnothman · 2019-08-13T20:59:43Z

sklearn/preprocessing/label.py

@@ -208,8 +211,11 @@ class LabelEncoder(BaseEstimator, TransformerMixin):

    See also
    --------
-    sklearn.preprocessing.OrdinalEncoder : encode categorical features
+    sklearn.preprocessing.OrdinalEncoder : Encode categorical input features


Not sure why "input features" is clearer

In some people's minds, "features" may not necessarily be strongly associated with "input". Once you start think of features, as inputs, then it's very clear, before that, you just have a matrix, which a bunch of columns/features, and one of those columns happens to be your output. You use the rest of the matrix to predict that single column. From that perspective, a feature can be input, or output. I have had usecases where we'd take a matrix, and do basically a for loop over columns and try to predict them using the other ones!

Hmmmm... In my experience the use of "features" means things that can be observed (or imputed) at test time. "Variable" has the mixed meaning of independent observed predictor and dependent target response.

For me "input feature" contrasts with the feature representation as output by transform

I'd agree with you in general, but my point in this PR was that "don't use this if you're dealing with input", I thought putting emphasis on that would help. Cause the same way you could argue that target label is also self explanatory and it's the output, but people still don't realize that and end up trying to use this for input features. I removed it anyway.

jnothman · 2019-08-14T21:48:22Z

Thanks. I don't think "target" is universal terminology either...

clarify LabelEncoder's docstring

3b5d6e8

adrinjalali changed the title ~~clarify LabelEncoder's docstring~~ DOC clarify LabelEncoder's docstring Aug 13, 2019

add OneHotEncoder to see also

c6f67e3

adrinjalali changed the title ~~DOC clarify LabelEncoder's docstring~~ [MRG] DOC clarify LabelEncoder's docstring Aug 13, 2019

NicolasHug approved these changes Aug 13, 2019

View reviewed changes

sklearn/preprocessing/label.py Outdated Show resolved Hide resolved

nit

803d16f

jnothman reviewed Aug 13, 2019

View reviewed changes

remove input and integer

a006f26

jnothman approved these changes Aug 14, 2019

View reviewed changes

jnothman merged commit f749bed into scikit-learn:master Aug 14, 2019

adrinjalali deleted the doc/labelencoder branch August 15, 2019 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] DOC clarify LabelEncoder's docstring #14642

[MRG] DOC clarify LabelEncoder's docstring #14642

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		using an ordinal encoding scheme.

		sklearn.preprocessing.OneHotEncoder : Encode categorical integer features

Uh oh!

[MRG] DOC clarify LabelEncoder's docstring #14642

[MRG] DOC clarify LabelEncoder's docstring #14642

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!