-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
StratifiedKFold doesnt work with bytes classes #16980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree that the error message is not as helpful as it could be, but I
don't think bytes is the right format for that target and I couldn't
promise that it was supported for y anywhere else in the library either.
How did you load the data? with fetch_openml?
|
The error is thrown when synthetic data are used too. Could we agree to label this as a bug? Pull requests modifying the error message will be very welcome. |
I want to take this issue. @cmarmo I think we can add a new check to catch the # catch bytes type
if isinstance(y[0], bytes):
raise ValueError(f'Array-like sequences of {type(y[0])} is unsupported.'
' Consider casting byte data to a supported format like'
' int or str')
# The old sequence of sequences format
try:
if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
and not isinstance(y[0], str) and is_multilabel(y)):
raise ValueError('You might to be using a legacy multi-label data'
' representation. Sequence of sequences are no'
' longer supported; use a binary array or sparse'
' matrix instead - the MultiLabelBinarizer'
' transformer can convert to this format.')
except IndexError:
pass Another alternative would be to include the check for old sequence inside the if is_multilabel(y):
# The old sequence of sequences format
try:
if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
and not isinstance(y[0], str)):
raise ValueError('You might to be using a legacy multi-label data'
' representation. Sequence of sequences are no'
' longer supported; use a binary array or sparse'
' matrix instead - the MultiLabelBinarizer'
' transformer can convert to this format.')
except IndexError:
pass
return 'multilabel-indicator' Also, I think that simply saying # Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
not isinstance(y.flat[0], str)):
return f'{type(y[0])}' # [[[1, 2]]] or [obj_1] and not ["label_1"]
if y.ndim == 2 and y.shape[1] == 0:
return f'{type(y[0])}' # [[]] Please suggest how I should proceed. |
Sure! Please go on!
I'm probably not the right person to advise about that... I suggest you to open a pull request with your preferred option and an explicit error message and let core devs have a look on it. Thanks! |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
I'm using this dataset https://www.openml.org/d/1515 where classes (20 in total) are represented as bytes. When trying to run cross-validation I got the error message:
I think this message is not proper, because there is single label for each sample.
Code to Reproduce
I got the error:
When I change the
y
tostr
it works as expected:prints:
[50< 8000 /span> 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49] [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
Versions
scikit-learn 0.22.2
The text was updated successfully, but these errors were encountered: