8000 [WIP] Make LabelEncoder more friendly to new labels by mjbommar · Pull Request #3243 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] Make LabelEncoder more friendly to new labels #3243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 59 commits into from

Conversation

mjbommar
Copy link
Contributor
@mjbommar mjbommar commented Jun 4, 2014

This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:

  • "raise": current behavior, i.e., raise ValueError; to remain default behavior
  • "nan": return np.nan for unseen/new labels
  • "update": update classes_ with new IDs [N, ..., N+m-1] for m new labels and assign
  • "label": set newly seen labels to have fixed class new_label_class=-1

Tests and documentation updates included.

(edit: adding "label" to list for quick summary)

@jnothman
Copy link
Member
jnothman commented Jun 5, 2014

What about selecting a fixed "default" label? I assume this would be at
least as common as wanting nan as a label.

On 5 June 2014 03:42, Michael Bommarito notifications@github.com wrote:

This PR intends to make preprocessing.LabelEncoder more friendly for
production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform,
LabelEncoder may be initialized with new_labels as:

  • "raise": current behavior, i.e., raise ValueError; to remain
    default behavior
  • "nan": return np.nan for unseen/new labels
  • "update": update classes_ with new IDs [N, ..., N+m-1] for m new
    labels and assign

Tests and documentation updates included.

You can merge this Pull Request by running

git pull https://github.com/mjbommar/scikit-learn label-encoder-unseen

Or view, comment on, or merge it at:

#3243
Commit Summary

  • Adding new_labels argument to LabelEncoder
  • Adding tests for new_labels argument.
  • Changing classes_ update strategy
  • Adding nan behavior, renaming to
  • Updating tests to include nan case and update name
  • Fixing docstring for test-doc pass
  • Fixing docstring for test-doc pass (for real)
  • Updating doctests
  • Updating constructor documentation

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#3243.

@mjbommar
Copy link
Contributor Author
mjbommar commented Jun 5, 2014

@jnothman, good idea. I would normally hit it with pd.fillna after, but that would be even friendlier.

@mjbommar
Copy link
Contributor Author
mjbommar commented Jun 8, 2014

@jnothman, should I add a new example to preprocessing.rst that shows how to handle this? I think this issue of handling unseen categorical labels is a very common pitfall for people and I seem to run into it very often when teaching.

@jnothman
Copy link
Member
jnothman commented Jun 8, 2014

Yes, I think an example would be helpful.

@mjbommar
Copy link
Contributor Author
mjbommar commented Jun 8, 2014

@jnothman, another subtle point about np.nan here - if we allow for np.nan encodings, we need to make our returned array as float*, not int*.

Which do we want?

  • All encodings returned as np.float64 (handles np.nan)
  • new_labels="nan" returned as np.float64, else np.int64

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) when pulling ab788f7 on mjbommar:label-encoder-unseen into d298a37 on scikit-learn:master.

@jnothman
Copy link
Member
jnothman commented Jun 8, 2014

As a categorical label, NaN seems a bit strange altogether, given that it is a float. Is the option needed?

But if it's there, yes, I'd say upcast to a float type (could use find_common_type).

@mjbommar
Copy link
Contributor Author
mjbommar commented Jun 8, 2014

Well, this is the same issues that pandas addresses more generally, and Wes seemed to find no better solution than upcasting for np.nan:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless upcast, so do we need to use find_common_type?

@jnothman
Copy link
Member
jnothman commented Jun 8, 2014

Oh perhaps not. I was thinking of the case where a smaller float is
suitable.

On 9 June 2014 08:33, Michael Bommarito notifications@github.com wrote:

Well, this is the same issues that pandas addresses more generally, and
Wes seemed to find no better solution than upcasting for np.nan:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless
upcast, so do we need to use find_common_type?


Reply to this email directly or view it on GitHub
#3243 (comment)
.

@jnothman
Copy link
Member
jnothman commented Jun 8, 2014

But I'm not sure find_common_type helps there anyway

On 9 June 2014 08:44, Joel Nothman jnothman@student.usyd.edu.au wrote:

Oh perhaps not. I was thinking of the case where a smaller float is
suitable.

On 9 June 2014 08:33, Michael Bommarito notifications@github.com wrote:

Well, this is the same issues that pandas addresses more generally, and
Wes seemed to find no better solution than upcasting for np.nan:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless
upcast, so do we need to use find_common_type?


Reply to this email directly or view it on GitHub
#3243 (comment)
.

@mjbommar
Copy link
Contributor Author
mjbommar commented Jun 8, 2014

OK, the version I have currently pushed has the proposed float/int logic.

@mjbommar
Copy link
Contributor Author

@jnothman, just wanted to see if you were waiting on anything from me on this. I think I've addressed your comments thus far but wanted to make sure.

@jnothman
Copy link
Member

I've not got further than looking at the PR description! It's a busy week, and I'm overseas for the next two, so I'm avoiding promises to review atm.

@mjbommar
Copy link
Contributor Author

Cleanly rebased final PR pending.

@mjbommar
Copy link
Contributor Author

Closing for PR #3483.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0