10000 StratifiedKFold doesnt work with bytes classes · Issue #16980 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

StratifiedKFold doesnt work with bytes classes #16980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pplonski opened this issue Apr 21, 2020 · 4 comments · Fixed by #18555
Closed

StratifiedKFold doesnt work with bytes classes #16980

pplonski opened this issue Apr 21, 2020 · 4 comments · Fixed by #18555

Comments

@pplonski
Copy link
Contributor
pplonski commented Apr 21, 2020

Describe the bug

I'm using this dataset https://www.openml.org/d/1515 where classes (20 in total) are represented as bytes. When trying to run cross-validation I got the error message:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

I think this message is not proper, because there is single label for each sample.

Code to Reproduce

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
# generate data
nrows = 100
ncols = 3
X = np.random.rand(nrows,ncols)
y = pd.Series([b'a', b'b']*int(nrows/2))
# show y
print(y)
print(type(y[0]))
# split data
skf = StratifiedKFold(n_splits=2)
for train, test in skf.split(X, y):
    print(train, test)

I got the error:

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/utils/multiclass.py in type_of_target(y)
    258         if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
    259                 and not isinstance(y[0], str)):
--> 260             raise ValueError('You appear to be using a legacy multi-label data'
    261                              ' representation. Sequence of sequences are no'
    262                              ' longer supported; use a binary array or sparse'

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

When I change the y to str it works as expected:

y = y.astype(str)
skf = StratifiedKFold(n_splits=2)
for train, test in skf.split(X, y):
    print(train, test)

prints:

[50<
8000
/span> 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99] [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49] [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99]

Versions

scikit-learn 0.22.2

@jnothman
Copy link
Member
jnothman commented Apr 22, 2020 via email

@cmarmo
Copy link
Contributor
cmarmo commented Sep 25, 2020

The error is thrown when synthetic data are used too. Could we agree to label this as a bug? Pull requests modifying the error message will be very welcome.

@cozek
Copy link
Contributor
cozek commented Oct 1, 2020

The error is thrown when synthetic data are used too. Could we agree to label this as a bug? Pull requests modifying the error message will be very welcome.

I want to take this issue. @cmarmo

I think we can add a new check to catch the bytes type because @jnothman mentioned that bytes is not a right target and was possibly never supported. Also we can modify the check for catching old sequence of sequence datatype by including an additional is_multilabel(y) to check if the passed labels are actually multilabel.

    # catch bytes type
    if isinstance(y[0], bytes):
        raise ValueError(f'Array-like sequences of {type(y[0])} is unsupported.'
                        ' Consider casting byte data to a supported format like'
                        ' int or str')



    # The old sequence of sequences format
    try:
        if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
                and not isinstance(y[0], str) and is_multilabel(y)):
            raise ValueError('You might to be using a legacy multi-label data'
                            ' representation. Sequence of sequences are no'
                            ' longer supported; use a binary array or sparse'
                            ' matrix instead - the MultiLabelBinarizer'
                            ' transformer can convert to this format.')
    except IndexError:
        pass

Another alternative would be to include the check for old sequence inside the if is_multilabel(y) block like the following since calling is_multilabel again is a bit redundant. This makes it so that the bytes type is caught as unknown.

    if is_multilabel(y):
        # The old sequence of sequences format
        try:
            if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
                    and not isinstance(y[0], str)):
                raise ValueError('You might to be using a legacy multi-label data'
                                ' representation. Sequence of sequences are no'
                                ' longer supported; use a binary array or sparse'
                                ' matrix instead - the MultiLabelBinarizer'
                                ' transformer can convert to this format.')
        except IndexError:
            pass
        return 'multilabel-indicator'

Also, I think that simply sayingunknown is also very unhelpful. Perhaps a fstring can fix that as follows. So perhaps in addition to fixing the bytes problem we could also replace the unknown as follows:

    # Invalid inputs
    if y.ndim > 2 or (y.dtype == object and len(y) and
                      not isinstance(y.flat[0], str)):
        return f'{type(y[0])}'  # [[[1, 2]]] or [obj_1] and not ["label_1"]

    if y.ndim == 2 and y.shape[1] == 0:
        return f'{type(y[0])}' # [[]]

Please suggest how I should proceed.
Thanks!

@cmarmo
Copy link
Contributor
cmarmo commented Oct 1, 2020

I want to take this issue. @cmarmo

Sure! Please go on!

Please suggest how I should proceed.
Thanks!

I'm probably not the right person to advise about that... I suggest you to open a pull request with your preferred option and an explicit error message and let core devs have a look on it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0