Closed
Description
Describe the bug
Hello, I've encountered an issue where the KNNImputer fails to record the fields where there were missing data at the time when .fit
is called, but not recognised if .transform
is called on a dense matrix. I would have expected it to return a 2x3 matrix rather than 2x2, with missingindicator_A = False
for all cases.
Reproduction steps below. Any help much appreciated :)
Steps/Code to Reproduce
>>> import pandas as pd
>>> from sklearn.impute import KNNImputer
>>> knn = KNNImputer(add_indicator=True)
>>> df = pd.DataFrame({'A': [0, None], 'B': [1, 2]})
>>> df
A B
0 0.0 1
1 NaN 2
>>> knn.fit(df)
KNNImputer(add_indicator=True)
>>> pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())
A B missingindicator_A
7099
0 0.0 1.0 0.0
1 0.0 2.0 1.0
>>> df['A'] = 0
>>> pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())
Expected Results
A B missingindicator_A
0 0.0 1.0 0.0
1 0.0 2.0 0.0
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[30], line 1
----> 1 pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())
File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:694, in DataFrame.__init__(self, data, index, columns, dtype, copy)
684 mgr = dict_to_mgr(
685 # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
686 # attribute "name"
(...)
691 typ=manager,
692 )
693 else:
--> 694 mgr = ndarray_to_mgr(
695 data,
696 index,
697 columns,
698 dtype=dtype,
699 copy=copy,
700 typ=manager,
701 )
703 # For data is list-like, or Iterable (will consume into list)
704 elif is_list_like(data):
File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/construction.py:351, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
346 # _prep_ndarray ensures that values.ndim == 2 at this point
347 index, columns = _get_axes(
348 values.shape[0], values.shape[1], index=index, columns=columns
349 )
--> 351 _check_values_indices_shape_match(values, index, columns)
353 if typ == "array":
355 if issubclass(values.dtype.type, str):
File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/construction.py:422, in _check_values_indices_shape_match(values, index, columns)
420 passed = values.shape
421 implied = (len(index), len(columns))
--> 422 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (2, 2), indices imply (2, 3)
Versions
python3, sklearn = 1.2.1