[MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 #13253

maikia · 2019-02-25T15:12:08Z

Issue: ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147
if 8000 features are given as pandas dframe, each column keeps its datatype

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Closes #12147

…into data_frames

jorisvandenbossche

So for me, a main question now is how to avoid that we validate X twice if it was already an array from the start. Maybe we can just don't do much validation in the first _check_X.

jorisvandenbossche · 2019-02-25T16:29:00Z

sklearn/preprocessing/_encoders.py

@@ -41,6 +41,9 @@ def _check_X(self, X):
          not do that)

        """
+        if hasattr(X, 'iloc'):
+            # if pandas dataframes


Can you add some comment here about why we skip the check for pandas? (because it will be done column by column later on).
Or maybe add it to the docstring above.

I added a comment (as above)
and check if instead of pandas DataFrame Sequence is not passed
and a couple of tests

…into data_frames

…g the dimensions

…into data_frames

…st for test_X_is_not_1D_pandas

jnothman

Nitpick

sklearn/preprocessing/tests/test_encoders.py

jnothman

Actually, now reading Joris's comment, I'm not convinced by this approach. What's the justification for checking column by column (obviously we need asarray for each column of a datadframe)? Can we use pd.isna to check for NaNs?

sklearn/preprocessing/_encoders.py

…ture by feature; validation is done on the very beginning for all data types this way.

…into data_frames

jorisvandenbossche · 2019-02-28T06:35:48Z

@jnothman can you give this a new look? It's now with the list of arrays idea (I am not yet fully sure it is better, but it works)

jnothman · 2019-02-28T09:00:13Z

Yes, let's not be too picky if X is a list... It is either interpreted as numeric or as objects

jorisvandenbossche

A few small nitpicks, looks good otherwise!

sklearn/preprocessing/_encoders.py

jorisvandenbossche · 2019-02-28T13:19:22Z

sklearn/preprocessing/_encoders.py

+        if not (hasattr(X, 'iloc') and getattr(X, 'ndim', 0) == 2):
+            X_temp = check_array(X, dtype=None)
+            if not hasattr(X, 'dtype') and\
+               np.issubdtype(X_temp.dtype, np.str_):


can you format this as:

if (not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_)):

(avoiding the \)

sklearn/preprocessing/_encoders.py

GaelVaroquaux

LGTM with the suggestions that I made to improve comments in the code (as this code is a bit surprising, without knowing the background).

glemaitre

Couple of comments. I think that we would need an entry in what's new.

glemaitre · 2019-02-26T17:21:30Z

sklearn/preprocessing/_encoders.py

+        if hasattr(X, 'iloc'):
+            # pandas dataframes
+            return X.iloc[:, key]
+        else:


you can remove else and unindent the return statement below.

glemaitre · 2019-02-28T14:18:07Z

sklearn/preprocessing/_encoders.py

+        return X_columns, n_samples, n_features
+
+    def _get_feature(self, X, key):
+        if hasattr(X, 'iloc'):


return X.iloc[:, key] if hasattr(X, 'iloc) else X[:, key]

It's certainly less lines, but not sure if it improves readability :-)

Skip it if you think it is best. I don't have strong opinion here.

glemaitre · 2019-02-28T15:01:15Z

sklearn/preprocessing/_encoders.py


-        return X
+        for i in range(n_features):
+            Xi = self._get_feature(X, key=i)


Is there a reason to call it key and not feature_idx or feature_index?

glemaitre · 2019-02-28T15:17:47Z

sklearn/preprocessing/tests/test_encoders.py

+@pytest.mark.parametrize("X", [
+    [1, 2],
+    np.array([3., 4.])
+    ])


@pytest.mark.parametrize(""method", ['fit', 'fit_transform'])

glemaitre · 2019-02-28T15:17:56Z

sklearn/preprocessing/tests/test_encoders.py

+    [1, 2],
+    np.array([3., 4.])
+    ])
+def test_X_is_not_1D(X):


Suggested change

def test_X_is_not_1D(X):

def test_X_is_not_1D(X, method):

glemaitre · 2019-02-28T15:18:18Z

sklearn/preprocessing/tests/test_encoders.py

+
+    msg = ("Expected 2D array, got 1D array instead")
+    with pytest.raises(ValueError, match=msg):
+        oh.fit(X)


Suggested change

oh.fit(X)

getattr(oh, method)(X)

glemaitre · 2019-02-28T15:18:28Z

sklearn/preprocessing/tests/test_encoders.py

+    msg = ("Expected 2D array, got 1D array instead")
+    with pytest.raises(ValueError, match=msg):
+        oh.fit(X)
+    with pytest.raises(ValueError, match=msg):


Then you can remove this part

glemaitre · 2019-02-28T15:19:01Z

sklearn/preprocessing/tests/test_encoders.py

+        oh.fit_transform(X)
+
+
+def test_X_is_not_1D_pandas():


I would do the same parametrization than above.

Co-Authored-By: maikia <maja_ka@hotmail.com>

…into data_frames

Substantive subsequent changes

jnothman · 2019-03-01T06:08:37Z

Please resolve conflicts with master, @maikia, then we can merge! Thanks.

…into data_frames

jorisvandenbossche · 2019-03-01T10:03:04Z

Thanks!

maikia · 2019-03-01T10:30:10Z

Thanks :-)

…o array scikit-learn#12147 (scikit-learn#13253)

…erting to array scikit-learn#12147 (scikit-learn#13253)" This reverts commit d94af6f.

…o array scikit-learn#12147 (scikit-learn#13253)

maikia added 6 commits February 25, 2019 12:09

test for checking the type of the features from OneHotEncoders

55b1374

keeps the dtype of every column (encoders)

a7b21db

some PEP8 violations corrected

638e2c3

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

0cbbba7

…into data_frames

keeping the same column dtype in _transform as well

622bc3a

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

21cba20

…into data_frames

8000 jorisvandenbossche reviewed Feb 25, 2019

View reviewed changes

maikia mentioned this pull request Feb 26, 2019

Clean-up warnings from tests #13262

Closed

maikia added 9 commits February 26, 2019 10:31

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

68025e9

…into data_frames

corrected PEP8

12f8837

comment added on why we dont check the full pandas dataframe

6828fa3

pass only pandas dataframe, make sure it is not a sequence by checkin…

fd79f0b

…g the dimensions

tests to make sure X is 2D

482f132

PEP8

a65ab6d

8000

PEP8

e278e24

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

421f8da

…into data_frames

removed checking for nans multiple times, corrected comment, added te…

85039f6

…st for test_X_is_not_1D_pandas

glemaitre self-requested a review February 26, 2019 17:13

jnothman approved these changes Feb 26, 2019

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Show resolved Hide resolved

jnothman previously requested changes Feb 26, 2019

View reviewed changes

jnothman reviewed Feb 26, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

jnothman reviewed Feb 26, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

maikia added 4 commits February 27, 2019 14:19

removed _check_X_feature(), now _check_X also does the validation fea…

d41ec92

…ture by feature; validation is done on the very beginning for all data types this way.

pep8

959ff84

better heading for _check_X()

233374c

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

f82efea

…into data_frames

jorisvandenbossche reviewed Feb 28, 2019

View reviewed changes

few comments added

f39eee9

jorisvandenbossche changed the title ~~ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147~~ [MRG] ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147 Feb 28, 2019

jorisvandenbossche reviewed Feb 28, 2019

View reviewed changes

GaelVaroquaux reviewed Feb 28, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Feb 28, 2019

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

GaelVaroquaux approved these changes Feb 28, 2019

View reviewed changes

glemaitre requested changes Feb 28, 2019

View reviewed changes

GaelVaroquaux and others added 5 commits February 28, 2019 16:59

Update sklearn/preprocessing/_encoders.py

418b2bd

Co-Authored-By: maikia <maja_ka@hotmail.com>

Update sklearn/preprocessing/_encoders.py

b86edcb

Co-Authored-By: maikia <maja_ka@hotmail.com>

added some comments and cleaned up the code

3aba39f

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

cedc230

…into data_frames

final edits

451d27e

jorisvandenbossche approved these changes Feb 28, 2019

View reviewed changes

jorisvandenbossche changed the title ~~[MRG] ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147~~ [MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 Feb 28, 2019

glemaitre approved these changes Feb 28, 2019

View reviewed changes

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

041cb29

…into data_frames

jorisvandenbossche merged commit b2344a4 into scikit-learn:master Mar 1, 2019

jorisvandenbossche mentioned this pull request Mar 1, 2019

[MRG] CLN: remove duplicate validation of X in Encoders transform #13347

Merged

jorisvandenbossche mentioned this pull request Mar 29, 2019

[WIP] Handle missing values in OrdinalEncoder #12045

Closed

jnothman mentioned this pull request Apr 24, 2019

DOC what's new cleaning #13706

Merged

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

ENH: process DataFrames in OneHot/OrdinalEncoder without converting t…

d94af6f

…o array scikit-learn#12147 (scikit-learn#13253)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "ENH: process DataFrames in OneHot/OrdinalEncoder without conv…

cb71708

…erting to array scikit-learn#12147 (scikit-learn#13253)" This reverts commit d94af6f.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "ENH: process DataFrames in OneHot/OrdinalEncoder without conv…

61105c9

…erting to array scikit-learn#12147 (scikit-learn#13253)" This reverts commit d94af6f.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

ENH: process DataFrames in OneHot/OrdinalEncoder without converting t…

1a259da

…o array scikit-learn#12147 (scikit-learn#13253)

amueller mentioned this pull request Aug 6, 2019

Pandas DataFrame Categories supported by OneHotEncoder #13351

Closed

4 tasks

This was referenced Oct 28, 2019

Handle pd.Categorical in encoders #14953

Open

ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147

Closed

Uh oh!

[MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 #13253

[MRG] ENH: process DataFrames in OneHot/OrdinalEncoder without converting to array #12147 #13253

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!