StratifiedKFold doesnt work with bytes classes #16980

pplonski · 2020-04-21T07:44:37Z

Describe the bug

I'm using this dataset https://www.openml.org/d/1515 where classes (20 in total) are represented as bytes. When trying to run cross-validation I got the error message:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

I think this message is not proper, because there is single label for each sample.

Code to Reproduce

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
# generate data
nrows = 100
ncols = 3
X = np.random.rand(nrows,ncols)
y = pd.Series([b'a', b'b']*int(nrows/2))
# show y
print(y)
print(type(y[0]))
# split data
skf = StratifiedKFold(n_splits=2)
for train, test in skf.split(X, y):
    print(train, test)

I got the error:

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/utils/multiclass.py in type_of_target(y)
    258         if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
    259                 and not isinstance(y[0], str)):
--> 260             raise ValueError('You appear to be using a legacy multi-label data'
    261                              ' representation. Sequence of sequences are no'
    262                              ' longer supported; use a binary array or sparse'

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

When I change the y to str it works as expected:

y = y.astype(str)
skf = StratifiedKFold(n_splits=2)
for train, test in skf.split(X, y):
    print(train, test)

prints:

[50<
8000
/span> 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99] [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49] [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99]

Versions

scikit-learn 0.22.2

The text was updated successfully, but these errors were encountered:

jnothman · 2020-04-22T01:52:55Z

I agree that the error message is not as helpful as it could be, but I don't think bytes is the right format for that target and I couldn't promise that it was supported for y anywhere else in the library either. How did you load the data? with fetch_openml?

cmarmo · 2020-09-25T10:04:42Z

The error is thrown when synthetic data are used too. Could we agree to label this as a bug? Pull requests modifying the error message will be very welcome.

cozek · 2020-10-01T11:50:18Z

The error is thrown when synthetic data are used too. Could we agree to label this as a bug? Pull requests modifying the error message will be very welcome.

I want to take this issue. @cmarmo

I think we can add a new check to catch the bytes type because @jnothman mentioned that bytes is not a right target and was possibly never supported. Also we can modify the check for catching old sequence of sequence datatype by including an additional is_multilabel(y) to check if the passed labels are actually multilabel.

    # catch bytes type
    if isinstance(y[0], bytes):
        raise ValueError(f'Array-like sequences of {type(y[0])} is unsupported.'
                        ' Consider casting byte data to a supported format like'
                        ' int or str')



    # The old sequence of sequences format
    try:
        if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
                and not isinstance(y[0], str) and is_multilabel(y)):
            raise ValueError('You might to be using a legacy multi-label data'
                            ' representation. Sequence of sequences are no'
                            ' longer supported; use a binary array or sparse'
                            ' matrix instead - the MultiLabelBinarizer'
                            ' transformer can convert to this format.')
    except IndexError:
        pass

Another alternative would be to include the check for old sequence inside the if is_multilabel(y) block like the following since calling is_multilabel again is a bit redundant. This makes it so that the bytes type is caught as unknown.

    if is_multilabel(y):
        # The old sequence of sequences format
        try:
            if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
                    and not isinstance(y[0], str)):
                raise ValueError('You might to be using a legacy multi-label data'
                                ' representation. Sequence of sequences are no'
                                ' longer supported; use a binary array or sparse'
                                ' matrix instead - the MultiLabelBinarizer'
                                ' transformer can convert to this format.')
        except IndexError:
            pass
        return 'multilabel-indicator'

Also, I think that simply sayingunknown is also very unhelpful. Perhaps a fstring can fix that as follows. So perhaps in addition to fixing the bytes problem we could also replace the unknown as follows:

    # Invalid inputs
    if y.ndim > 2 or (y.dtype == object and len(y) and
                      not isinstance(y.flat[0], str)):
        return f'{type(y[0])}'  # [[[1, 2]]] or [obj_1] and not ["label_1"]

    if y.ndim == 2 and y.shape[1] == 0:
        return f'{type(y[0])}' # [[]]

Please suggest how I should proceed.
Thanks!

cmarmo · 2020-10-01T14:48:29Z

I want to take this issue. @cmarmo

Sure! Please go on!

Please suggest how I should proceed.
Thanks!

I'm probably not the right person to advise about that... I suggest you to open a pull request with your preferred option and an explicit error message and let core devs have a look on it. Thanks!

pplonski added the Bug: triage label Apr 21, 2020

pplonski mentioned this issue Apr 21, 2020

Doesnt work with bytes as class mljar/mljar-supervised#66

Closed

cmarmo added the module:model_selection label Apr 30, 2020

cmarmo added Bug help wanted and removed Bug: triage labels Sep 25, 2020

cozek mentioned this issue Oct 7, 2020

API Deprecate using labels in bytes format #18555

Merged

cmarmo removed the help wanted label Oct 7, 2020

cmarmo added the Needs Decision Requires decision label Dec 16, 2021

adrinjalali closed this as completed in #18555 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

StratifiedKFold doesnt work with bytes classes #16980

StratifiedKFold doesnt work with bytes classes #16980

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StratifiedKFold doesnt work with bytes classes #16980

StratifiedKFold doesnt work with bytes classes #16980

Comments

Uh oh!

Describe the bug

Code to Reproduce

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!