[WIP] Make LabelEncoder more friendly to new labels #3243

mjbommar · 2014-06-04T17:42:49Z

This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:

"raise": current behavior, i.e., raise ValueError; to remain default behavior
"nan": return np.nan for unseen/new labels
"update": update classes_ with new IDs [N, ..., N+m-1] for m new labels and assign
"label": set newly seen labels to have fixed class new_label_class=-1

Tests and documentation updates included.

(edit: adding "label" to list for quick summary)

jnothman · 2014-06-05T12:48:18Z

What about selecting a fixed "default" label? I assume this would be at
least as common as wanting nan as a label.

On 5 June 2014 03:42, Michael Bommarito notifications@github.com wrote:

This PR intends to make preprocessing.LabelEncoder more friendly for
production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform,
LabelEncoder may be initialized with new_labels as:

"raise": current behavior, i.e., raise ValueError; to remain
default behavior

"nan": return np.nan for unseen/new labels

"update": update classes_ with new IDs [N, ..., N+m-1] for m new
labels and assign

Tests and documentation updates included.

You can merge this Pull Request by running

git pull https://github.com/mjbommar/scikit-learn label-encoder-unseen

Or view, comment on, or merge it at:

#3243
Commit Summary

Adding new_labels argument to LabelEncoder

Adding tests for new_labels argument.

Changing classes_ update strategy

Adding nan behavior, renaming to

Updating tests to include nan case and update name

Fixing docstring for test-doc pass

Fixing docstring for test-doc pass (for real)

Updating doctests

Updating constructor documentation

File Changes

M doc/modules/preprocessing.rst
https://github.com/scikit-learn/scikit-learn/pull/3243/files#diff-0
(4)

M sklearn/preprocessing/label.py
https://github.com/scikit-learn/scikit-learn/pull/3243/files#diff-1
(50)

M sklearn/preprocessing/tests/test_label.py
https://github.com/scikit-learn/scikit-learn/pull/3243/files#diff-2
(38)

Patch Links:

https://github.com/scikit-learn/scikit-learn/pull/3243.patch

https://github.com/scikit-learn/scikit-learn/pull/3243.diff

—
Reply to this email directly or view it on GitHub
#3243.

mjbommar · 2014-06-05T12:52:41Z

@jnothman, good idea. I would normally hit it with pd.fillna after, but that would be even friendlier.

mjbommar · 2014-06-08T14:04:55Z

@jnothman, should I add a new example to preprocessing.rst that shows how to handle this? I think this issue of handling unseen categorical labels is a very common pitfall for people and I seem to run into it very often when teaching.

jnothman · 2014-06-08T14:14:21Z

Yes, I think an example would be helpful.

mjbommar · 2014-06-08T14:57:51Z

@jnothman, another subtle point about np.nan here - if we allow for np.nan encodings, we need to make our returned array as float*, not int*.

Which do we want?

All encodings returned as np.float64 (handles np.nan)
new_labels="nan" returned as np.float64, else np.int64

coveralls · 2014-06-08T20:05:54Z

Coverage decreased (-0.0%) when pulling ab788f7 on mjbommar:label-encoder-unseen into d298a37 on scikit-learn:master.

jnothman · 2014-06-08T22:21:56Z

As a categorical label, NaN seems a bit strange altogether, given that it is a float. Is the option needed?

But if it's there, yes, I'd say upcast to a float type (could use find_common_type).

mjbommar · 2014-06-08T22:33:14Z

Well, this is the same issues that pandas addresses more generally, and Wes seemed to find no better solution than upcasting for np.nan:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless upcast, so do we need to use find_common_type?

jnothman · 2014-06-08T22:44:52Z

Oh perhaps not. I was thinking of the case where a smaller float is
suitable.

On 9 June 2014 08:33, Michael Bommarito notifications@github.com wrote:

Well, this is the same issues that pandas addresses more generally, and
Wes seemed to find no better solution than upcasting for np.nan:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless
upcast, so do we need to use find_common_type?

—
Reply to this email directly or view it on GitHub
#3243 (comment)
.

jnothman · 2014-06-08T22:45:17Z

But I'm not sure find_common_type helps there anyway

On 9 June 2014 08:44, Joel Nothman jnothman@student.usyd.edu.au wrote:

Oh perhaps not. I was thinking of the case where a smaller float is
suitable.

On 9 June 2014 08:33, Michael Bommarito notifications@github.com wrote:

Well, this is the same issues that pandas addresses more generally, and
Wes seemed to find no better solution than upcasting for np.nan:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless
upcast, so do we need to use find_common_type?

—
Reply to this email directly or view it on GitHub
#3243 (comment)
.

mjbommar · 2014-06-08T22:59:42Z

OK, the version I have currently pushed has the proposed float/int logic.

mjbommar · 2014-06-10T14:56:07Z

@jnothman, just wanted to see if you were waiting on anything from me on this. I think I've addressed your comments thus far but wanted to make sure.

jnothman · 2014-06-10T15:51:32Z

I've not got further than looking at the PR description! It's a busy week, and I'm overseas for the next two, so I'm avoiding promises to review atm.

…transform with new labels

…bels=update w/ searchsorted

…g, cleaning after removing np.nan.

mjbommar · 2014-07-24T12:28:42Z

Cleanly rebased final PR pending.

mjbommar · 2014-07-24T19:07:35Z

Closing for PR #3483.

mjbommar added 9 commits June 3, 2014 22:31

Adding new_labels argument to LabelEncoder

4d79789

Adding tests for new_labels argument.

2bc5686

Changing classes_ update strategy

8c1fafe

Adding nan behavior, renaming to

1ffb24a

Updating tests to include nan case and update name

99f65a9

Fixing docstring for test-doc pass

af8c6a9

Fixing docstring for test-doc pass (for real)

8ffc839

Updating doctests

e6fbc47

Updating constructor documentation

46118d9

mjbommar added 6 commits June 5, 2014 11:09

Adding specific "label" option to new_labels

8d21ec1

Adding test for "label" option to new_labels

343c726

Updating docstring for new_labels="label"

be97c14

pep8

cdd7147

Autodoc fix

170d00c

Fixing rst docs

2d87e88

mjbommar added 2 commits June 8, 2014 14:04

Changing dtypes for new_labels

bb8d9a6

Adding example for new_labels argument

ab788f7

mjbommar added 25 commits July 24, 2014 08:03

Adding tests for new_labels argument.

d990207

Changing classes_ update strategy

E7F5

a69840b

Adding nan behavior, renaming to

fce9fb5

Updating tests to include nan case and update name

76921e5

Fixing docstring for test-doc pass

0e39a2a

Fixing docstring for test-doc pass (for real)

1da2880

Updating doctests

926b166

Updating constructor documentation

5ef9b85

Adding specific "label" option to new_labels

4dfb4cb

Adding test for "label" option to new_labels

392e54b

Updating docstring for new_labels="label"

e053635

pep8

122a98f

Autodoc fix

de18372

Fixing rst docs

d735ca2

Changing dtypes for new_labels

d276565

Adding example for new_labels argument

a01f8b0

Adding new_labels handling to fit/fit_transform

495347c

Improving difficulty of test cases with non-increasing unseen labels

23D3

dee4ae0

Moving ValueError check to fit

c297017

Improving difficult for new_labels='update' test to include multiple …

f29800b

…transform with new labels

Fixing negative indexing, renamed z->out, failing approach for new_la…

74b7589

…bels=update w/ searchsorted

PEP8

3e1be5d

Removing nan option and corresponding test

abf01cc

Handling repeated transform calls with new_class_mapping_, refactorin…

f26a902

…g, cleaning after removing np.nan.

Rebase

0725d4c

mjbommar mentioned this pull request Jul 24, 2014

[MRG-0] Make LabelEncoder more friendly to new labels #3483

Closed

mjbommar closed this Jul 24, 2014

mjbommar mentioned this pull request Aug 27, 2014

[WIP] Sparse and Multioutput LabelEncoder #3592

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Make LabelEncoder more friendly to new labels #3243

[WIP] Make LabelEncoder more friendly to new labels #3243

Uh oh!

Uh oh!

Tests and documentation updates included.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] Make LabelEncoder more friendly to new labels #3243

[WIP] Make LabelEncoder more friendly to new labels #3243

Uh oh!

Conversation

Uh oh!

Tests and documentation updates included.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!