8000 Support nullable pandas dtypes in LabelBinarizer · Issue #25637 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Support nullable pandas dtypes in LabelBinarizer #25637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tamargrey opened this issue Feb 17, 2023 · 4 comments · Fixed by #25638 or #25813
Closed

Support nullable pandas dtypes in LabelBinarizer #25637

tamargrey opened this issue Feb 17, 2023 · 4 comments · Fixed by #25638 or #25813

Comments

@tamargrey
Copy link

Describe the workflow you want to enable

I would like to be able to pass the nullable pandas dtypes ("Int64", "Float64", "boolean") into sklearn's LabelBinarizer. Because the dtypes become object dtype when converted to numpy arrays we get ValueError: Unknown label type::

Repro with sklearn 1.2.1:

    import pandas as pd
    import pytest
    from sklearn.preprocessing import LabelBinarizer

    for dtype in ["Int64", "Float64", "boolean"]:

        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

        lb = LabelBinarizer()

        with pytest.raises(ValueError, match="Unknown label type:"):
            lb.fit(y_true.unique())

Describe your proposed solution

We should get the same behavior as when int64, float64, and bool dtypes are used, which is no error:

    import pandas as pd
    from sklearn.preprocessing import LabelBinarizer

    for dtype in ["int64", "float64", "bool"]:
        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

        lb = LabelBinarizer()

        lb.fit(y_true.unique())
        y_one_hot_true = lb.transform(y_true)

Describe alternatives you've considered, if relevant

Our current workaround is to convert the data to numpy arrays with the corresponding dtype that works prior to passing it into LabelBinarizer

Additional context

No response

@tamargrey tamargrey added Needs Triage Issue requires triage New Feature labels Feb 17, 2023
@thomasjpfan
Copy link
Member

As noted in #25634 (comment), I opened #25638 to resolve this issue.

@tamargrey
Copy link
Author

@thomasjpfan I'm still seeing the same errors from my reproduction above with sklearn 1.2.2 which has the code from #25638.

It'd odd, because the error is coming from unique_labels, which was a related issue #25634 (comment) that is, indeed, fixed.

@thomasjpfan thomasjpfan reopened this Mar 10, 2023
@thomasjpfan
Copy link
Member
thomasjpfan commented Mar 10, 2023

Even tho #25638 was placed on the 1.2.2 milestone, it was not cherry picked onto 1.2.2. If we have a 1.2.3 release, we can include it there. It's in the release, but the issue still remains.

With 1.2.2, passing in y_true directly works:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

for dtype in ["Int64", "Float64", "boolean"]:

    y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

    lb = LabelBinarizer()

    lb.fit(y_true)

but calling y_true.unique() 99CC does not right not. So I'll reopen this issue open.

@thomasjpfan
Copy link
Member

I opened #25638 to fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
0