-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Make LabelEncoder more friendly to new labels #3243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What about selecting a fixed "default" label? I assume this would be at On 5 June 2014 03:42, Michael Bommarito notifications@github.com wrote:
|
@jnothman, good idea. I would normally hit it with |
@jnothman, should I add a new example to preprocessing.rst that shows how to handle this? I think this issue of handling unseen categorical labels is a very common pitfall for people and I seem to run into it very often when teaching. |
Yes, I think an example would be helpful. |
@jnothman, another subtle point about Which do we want?
|
As a categorical label, NaN seems a bit strange altogether, given that it is a float. Is the option needed? But if it's there, yes, I'd say upcast to a float type (could use find_common_type). |
Well, this is the same issues that I think we know deterministically that the indices will be integer unless upcast, so do we need to use |
Oh perhaps not. I was thinking of the case where a smaller float is On 9 June 2014 08:33, Michael Bommarito notifications@github.com wrote:
|
But I'm not sure find_common_type helps there anyway On 9 June 2014 08:44, Joel Nothman jnothman@student.usyd.edu.au wrote:
|
OK, the version I have currently pushed has the proposed float/int logic. |
@jnothman, just wanted to see if you were waiting on anything from me on this. I think I've addressed your comments thus far but wanted to make sure. |
I've not got further than looking at the PR description! It's a busy week, and I'm overseas for the next two, so I'm avoiding promises to review atm. |
…transform with new labels
…bels=update w/ searchsorted
…g, cleaning after removing np.nan.
Cleanly rebased final PR pending. |
Closing for PR #3483. |
This PR intends to make
preprocessing.LabelEncoder
more friendly for production/pipeline usage by adding anew_labels
constructor argument.Instead of always raising ValueError for unseen/new labels in
transform
, LabelEncoder may be initialized withnew_labels
as:"raise"
: current behavior, i.e., raise ValueError; to remain default behavior"nan"
: return np.nan for unseen/new labels"update"
: updateclasses_
with new IDs[N, ..., N+m-1]
form
new labels and assign"label"
: set newly seen labels to have fixed classnew_label_class=-1
Tests and documentation updates included.
(edit: adding
"label"
to list for quick summary)