8000 inconsistent treatment of None and np.NaN in SimpleImputer · Issue #17625 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
inconsistent treatment of None and np.NaN in SimpleImputer #17625
Open
@amueller

Description

@amueller

Doing constant imputation treats only the "missing_value" as missing, so a None by default stays there:

from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([1, 2, np.NaN, None]).reshape(-1, 1)
SimpleImputer(strategy='constant', fill_value="asdf").fit_transform()
array([[1],
       [2],
       ['asdf'],
       [None]], dtype=object)

However, using strategy='mean' coerces the None to NaN and so both are replaced:

SimpleImputer(strategy='mean').fit_transform(X)
array([[1. ],
       [2. ],
       [1.5],
       [1.5]])

I don't think the definition of what's missing should depend on the strategy. @thomasjpfan argues that the current constant behavior is inconvenient because it means you have to impute both values separately if you want to one-hot-encode.

It seems more safe to treat them differently but I'm not sure there's a use-case for that. This came up in #17317.
I think this only matters in these two, as other imputers don't allow dtype object arrays.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0