8000 Handle missing values in OrdinalEncoder · Issue #11997 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Handle missing values in OrdinalEncoder #11997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Sep 4, 2018 · 11 comments
Closed

Handle missing values in OrdinalEncoder #11997

jnothman opened this issue Sep 4, 2018 · 11 comments

Comments

@jnothman
Copy link
Member
jnothman commented Sep 4, 2018

A minimal implementation would pass through NaNs from the input to the output of transform and make sure the presence of NaN does not affect the categories identified in fit.

A missing_values parameter might allow the user to configure what object is a placeholder for missingness (e.g. NaN, None, etc.).

See #10465 for background

@jnothman jnothman added Easy Well-defined and straightforward way to resolve help wanted good first issue Easy with clear instructions to resolve labels Sep 4, 2018
@maxcopeland
Copy link
Contributor

Hi @jnothman-- do you mind if I work on this?

@jnothman
Copy link
Member Author
jnothman commented Sep 4, 2018

Go for it

@jnothman
Copy link
Member Author
jnothman commented Sep 4, 2018

I suppose we might also consider a handle_missing param that would allow NaN to be encoded as the smallest/largest number...?

@jashrathod
Copy link

Hi.
I'm new so open source contributions. So can someone help me get started?

@maxcopeland
Copy link
Contributor

I'm currently working on this issue-- but I think the best way to start is to review the contributing guidlines. And when you see an issue no one is working on, ask the member who submitted the issue if you can get started. (I'm fairly new to this project myself).

@CatChenal
Copy link
Contributor

I wish the help wanted tag would disappear once a contributor adopts an issue...

@shashvat-kedia
Copy link

@jnothman I am new to this project and would like to contribute. Can I start by working on this issue?

@jnothman
Copy link
Member Author
jnothman commented Apr 1, 2019 via email

@jnothman jnothman removed Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve labels Apr 1, 2019
@Catadanna
Copy link

A suggestion: assign 0 only for missing values, and starting encoding from 1 (and not from 0 as it is done now), even when there are no missing values in the data set. Such a normalization could help identifying the preceding missing values more easier (in order to handle them).

@glemaitre
Copy link
Member
glemaitre commented Nov 13, 2019

By adding the option add_indicator in the imputer, we also make things difficult right now.
Indeed, one will have to define a pipeline imputer+encoder. If add_indicator=True, we will get some extra-columns which you don't want to encode.

The workaround is to make a column transform with a MissingIndicator and set add_indicator=False for the imputer.

A reasonable use case would be to first encode ignoring the missing values and then apply the imputer.

I might pick up this and make some reviews on the different PRs

EDIT: Since we will encode the missing values as a caregories, we will not need add_indicator=True in practise.

@thomasjpfan
Copy link
Member

I am closing this PR because this feature was added in #21988, which added encoded_missing_value to choose the encoding for missing values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
0