Support nullable pandas dtypes in LabelBinarizer #25637

tamargrey · 2023-02-17T20:04:46Z

Describe the workflow you want to enable

I would like to be able to pass the nullable pandas dtypes ("Int64", "Float64", "boolean") into sklearn's LabelBinarizer. Because the dtypes become object dtype when converted to numpy arrays we get ValueError: Unknown label type::

Repro with sklearn 1.2.1:

    import pandas as pd
    import pytest
    from sklearn.preprocessing import LabelBinarizer

    for dtype in ["Int64", "Float64", "boolean"]:

        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

        lb = LabelBinarizer()

        with pytest.raises(ValueError, match="Unknown label type:"):
            lb.fit(y_true.unique())

Describe your proposed solution

We should get the same behavior as when int64, float64, and bool dtypes are used, which is no error:

    import pandas as pd
    from sklearn.preprocessing import LabelBinarizer

    for dtype in ["int64", "float64", "bool"]:
        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

        lb = LabelBinarizer()

        lb.fit(y_true.unique())
        y_one_hot_true = lb.transform(y_true)

Describe alternatives you've considered, if relevant

Our current workaround is to convert the data to numpy arrays with the corresponding dtype that works prior to passing it into LabelBinarizer

Additional context

No response

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2023-02-17T22:21:38Z

As noted in #25634 (comment), I opened #25638 to resolve this issue.

tamargrey · 2023-03-10T15:47:05Z

@thomasjpfan I'm still seeing the same errors from my reproduction above with sklearn 1.2.2 which has the code from #25638.

It'd odd, because the error is coming from unique_labels, which was a related issue #25634 (comment) that is, indeed, fixed.

thomasjpfan · 2023-03-10T16:22:59Z

~~Even tho #25638 was placed on the 1.2.2 milestone, it was not cherry picked onto 1.2.2. If we have a 1.2.3 release, we can include it there.~~ It's in the release, but the issue still remains.

With 1.2.2, passing in y_true directly works:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

for dtype in ["Int64", "Float64", "boolean"]:

    y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

    lb = LabelBinarizer()

    lb.fit(y_true)

but calling y_true.unique() 99CC does not right not. So I'll reopen this issue open.

thomasjpfan · 2023-03-10T17:17:48Z

I opened #25638 to fix the issue.

tamargrey added Needs Triage Issue requires triage New Feature labels Feb 17, 2023

tamargrey mentioned this issue Feb 17, 2023

roc_curve: remove nullable type handling when possible alteryx/evalml#4021

Closed

thomasjpfan mentioned this issue Feb 17, 2023

ENH Allows target to be pandas nullable dtypes #25638

Merged

thomasjpfan added Pandas compatibility and removed Needs Triage Issue requires triage labels Feb 17, 2023

thomasjpfan added Pandas compatibility and removed Needs Triage Issue requires triage labels Feb 17, 2023

lorentzenchr closed this as completed in #25638 Feb 23, 2023

thomasjpfan reopened this Mar 10, 2023

thomasjpfan mentioned this issue Mar 10, 2023

FIX Fixes pandas extension arrays in check_array #25813

Merged

jeremiedbb closed this as completed in #25813 Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support nullable pandas dtypes in LabelBinarizer #25637

Support nullable pandas dtypes in LabelBinarizer #25637

Support nullable pandas dtypes in LabelBinarizer #25637

Support nullable pandas dtypes in LabelBinarizer #25637

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context