8000 [MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 by maikia · Pull Request #13253 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 #13253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Mar 1, 2019

Conversation

maikia
Copy link
Contributor
@maikia maikia commented Feb 25, 2019

Issue: ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147
if features are given as pandas dframe, each column keeps its datatype

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Closes #12147

Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for me, a main question now is how to avoid that we validate X twice if it was already an array from the start. Maybe we can just don't do much validation in the first _check_X.

@@ -41,6 +41,9 @@ def _check_X(self, X):
not do that)

"""
if hasattr(X, 'iloc'):
# if pandas dataframes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comment here about why we skip the check for pandas? (because it will be done column by column later on).
Or maybe add it to the docstring above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment (as above)
and check if instead of pandas DataFrame Sequence is not passed
and a couple of tests

@glemaitre glemaitre self-requested a review February 26, 2019 17:13
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now reading Joris's comment, I'm not convinced by this approach. What's the justification for checking column by column (obviously we need asarray for each column of a datadframe)? Can we use pd.isna to check for NaNs?

@jorisvandenbossche
Copy link
Member

@jnothman can you give this a new look? It's now with the list of arrays idea (I am not yet fully sure it is better, but it works)

@jnothman
Copy link
Member
jnothman commented Feb 28, 2019 via email

Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nitpicks, looks good otherwise!

@jorisvandenbossche jorisvandenbossche changed the title ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147 [MRG] ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147 Feb 28, 2019
if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
X_temp = check_array(X, dtype=None)
if not hasattr(X, 'dtype') and\
np.issubdtype(X_temp.dtype, np.str_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you format this as:

if (not hasattr(X, 'dtype')
        and np.issubdtype(X_temp.dtype, np.str_)):

(avoiding the \)

Copy link
Member
@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the suggestions that I made to improve comments in the code (as this code is a bit surprising, without knowing the background).

Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments. I think that we would need an entry in what's new.

if hasattr(X, 'iloc'):
# pandas dataframes
return X.iloc[:, key]
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove else and unindent the return statement below.

return X_columns, n_samples, n_features

def _get_feature(self, X, key):
if hasattr(X, 'iloc'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return X.iloc[:, key] if hasattr(X, 'iloc) else X[:, key]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly less lines, but not sure if it improves readability :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip it if you think it is best. I don't have strong opinion here.


return X
for i in range(n_features):
Xi = self._get_feature(X, key=i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to call it key and not feature_idx or feature_index?

@pytest.mark.parametrize("X", [
[1, 2],
np.array([3., 4.])
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytest.mark.parametrize(""method", ['fit', 'fit_transform'])

[1, 2],
np.array([3., 4.])
])
def test_X_is_not_1D(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_X_is_not_1D(X):
def test_X_is_not_1D(X, method):


msg = ("Expected 2D array, got 1D array instead")
with pytest.raises(ValueError, match=msg):
oh.fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oh.fit(X)
getattr(oh, method)(X)

msg = ("Expected 2D array, got 1D array instead")
with pytest.raises(ValueError, match=msg):
oh.fit(X)
with pytest.raises(ValueError, match=msg):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you can remove this part

oh.fit_transform(X)


def test_X_is_not_1D_pandas():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do the same parametrization than above.

@jorisvandenbossche jorisvandenbossche changed the title [MRG] ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147 [MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 Feb 28, 2019
@jnothman jnothman dismissed their stale review February 28, 2019 17:18

Substantive subsequent changes

@jnothman
Copy link
Member
jnothman commented Mar 1, 2019

Please resolve conflicts with master, @maikia, then we can merge! Thanks.

@jorisvandenbossche jorisvandenbossche merged commit b2344a4 into scikit-learn:master Mar 1, 2019
@jorisvandenbossche
Copy link
Member

Thanks!

@maikia
Copy link
Contributor Author
maikia commented Mar 1, 2019

Thanks :-)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
C818 xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array
5 participants
0