-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 #13253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into data_frames
…into data_frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for me, a main question now is how to avoid that we validate X
twice if it was already an array from the start. Maybe we can just don't do much validation in the first _check_X
.
sklearn/preprocessing/_encoders.py
Outdated
@@ -41,6 +41,9 @@ def _check_X(self, X): | |||
not do that) | |||
|
|||
""" | |||
if hasattr(X, 'iloc'): | |||
# if pandas dataframes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comment here about why we skip the check for pandas? (because it will be done column by column later on).
Or maybe add it to the docstring above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment (as above)
and check if instead of pandas DataFrame Sequence is not passed
and a couple of tests
…into data_frames
…into data_frames
…st for test_X_is_not_1D_pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now reading Joris's comment, I'm not convinced by this approach. What's the justification for checking column by column (obviously we need asarray for each column of a datadframe)? Can we use pd.isna
to check for NaNs?
…ture by feature; validation is done on the very beginning for all data types this way.
…into data_frames
@jnothman can you give this a new look? It's now with the list of arrays idea (I am not yet fully sure it is better, but it works) |
Yes, let's not be too picky if X is a list... It is either interpreted as
numeric or as objects
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small nitpicks, looks good otherwise!
sklearn/preprocessing/_encoders.py
Outdated
if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2): | ||
X_temp = check_array(X, dtype=None) | ||
if not hasattr(X, 'dtype') and\ | ||
np.issubdtype(X_temp.dtype, np.str_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you format this as:
if (not hasattr(X, 'dtype')
and np.issubdtype(X_temp.dtype, np.str_)):
(avoiding the \
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the suggestions that I made to improve comments in the code (as this code is a bit surprising, without knowing the background).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments. I think that we would need an entry in what's new.
sklearn/preprocessing/_encoders.py
Outdated
if hasattr(X, 'iloc'): | ||
# pandas dataframes | ||
return X.iloc[:, key] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove else
and unindent the return
statement below.
return X_columns, n_samples, n_features | ||
|
||
def _get_feature(self, X, key): | ||
if hasattr(X, 'iloc'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return X.iloc[:, key] if hasattr(X, 'iloc) else X[:, key]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly less lines, but not sure if it improves readability :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip it if you think it is best. I don't have strong opinion here.
sklearn/preprocessing/_encoders.py
Outdated
|
||
return X | ||
for i in range(n_features): | ||
Xi = self._get_feature(X, key=i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to call it key
and not feature_idx
or feature_index
?
@pytest.mark.parametrize("X", [ | ||
[1, 2], | ||
np.array([3., 4.]) | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize(""method", ['fit', 'fit_transform'])
[1, 2], | ||
np.array([3., 4.]) | ||
]) | ||
def test_X_is_not_1D(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_X_is_not_1D(X): | |
def test_X_is_not_1D(X, method): |
|
||
msg = ("Expected 2D array, got 1D array instead") | ||
with pytest.raises(ValueError, match=msg): | ||
oh.fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh.fit(X) | |
getattr(oh, method)(X) |
msg = ("Expected 2D array, got 1D array instead") | ||
with pytest.raises(ValueError, match=msg): | ||
oh.fit(X) | ||
with pytest.raises(ValueError, match=msg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you can remove this part
oh.fit_transform(X) | ||
|
||
|
||
def test_X_is_not_1D_pandas(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do the same parametrization than above.
Co-Authored-By: maikia <maja_ka@hotmail.com>
Co-Authored-By: maikia <maja_ka@hotmail.com>
…into data_frames
Please resolve conflicts with master, @maikia, then we can merge! Thanks. |
…into data_frames
Thanks! |
Thanks :-) |
…erting to array scikit-learn#12147 (scikit-learn#13253)" This reverts commit d94af6f.
…erting to array scikit-learn#12147 (scikit-learn#13253)" This reverts commit d94af6f.
Issue: ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147
if features are given as pandas dframe, each column keeps its datatype
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Any other comments?
Closes #12147