8000 [MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 by maikia · Pull Request #13253 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 #13253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Mar 1, 2019

Conversation

maikia
Copy link
Contributor
@maikia maikia commented Feb 25, 2019

Issue: ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147
if features are given as pandas dframe, each column keeps its datatype

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Closes #12147

Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for me, a main question now is how to avoid that we validate X twice if it was already an array from the start. Maybe we can just don't do much validation in the first _check_X.

@@ -41,6 +41,9 @@ def _check_X(self, X):
not do that)

"""
if hasattr(X, 'iloc'):
# if pandas dataframes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comment here about why we skip the check for pandas? (because it will be done column by column later on).
Or maybe add it to the docstring above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment (as above)
and check if instead of pandas DataFrame Sequence is not passed
and a couple of tests

@glemaitre glemaitre self-requested a review February 26, 2019 17:13
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now reading Joris's comment, I'm not convinced by this approach. What's the justification for checking column by column (obviously we need asarray for each column of a datadframe)? Can we use pd.isna to check for NaNs?

@jorisvandenbossche
Copy link
Member

@jnothman can you give this a new look? It's now with the list of arrays idea (I am not yet fully sure it is better, but it works)

@jnothman
Copy link
Member
jnothman commented Feb 28, 2019 via email

Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nitpicks, looks good otherwise!

@jorisvandenbossche jorisvandenbossche changed the title ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147 [MRG] ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147 Feb 28, 2019
if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
X_temp = check_array(X, dtype=None)
if not hasattr(X, 'dtype') and\
np.issubdtype(X_temp.dtype, np.str_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you format this as:

if (not hasattr(X, 'dtype')
        and np.issubdtype(X_temp.dtype, np.str_)):

(avoiding the \)

Copy link
Member
@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the suggestions that I made to improve comments in the code (as this code is a bit surprising, without knowing the background).

Development

Successfully merging this pull request may close these issues.

ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array
5 participants
0