inconsistent treatment of None and np.NaN in SimpleImputer

@thomasjpfan

Doing constant imputation treats only the "missing_value" as missing, so a None by default stays there:

from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([1, 2, np.NaN, None]).reshape(-1, 1)
SimpleImputer(strategy='constant', fill_value="asdf").fit_transform()

array([[1],
       [2],
       ['asdf'],
       [None]], dtype=object)

However, using strategy='mean' coerces the None to NaN and so both are replaced:

SimpleImputer(strategy='mean').fit_transform(X)

array([[1. ],
       [2. ],
       [1.5],
       [1.5]])

I don't think the definition of what's missing should depend on the strategy. @thomasjpfan argues that the current constant behavior is inconvenient because it means you have to impute both values separately if you want to one-hot-encode.

It seems more safe to treat them differently but I'm not sure there's a use-case for that. This came up in #17317.
I think this only matters in these two, as other imputers don't allow dtype object arrays.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions