8000 KNNImputer add_indicator fails to persist where missing data had been present in training · Issue #26590 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
KNNImputer add_indicator fails to persist where missing data had been present in training #26590
Closed
@djwbamboo

Description

@djwbamboo

Describe the bug

Hello, I've encountered an issue where the KNNImputer fails to record the fields where there were missing data at the time when .fit is called, but not recognised if .transform is called on a dense matrix. I would have expected it to return a 2x3 matrix rather than 2x2, with missingindicator_A = False for all cases.

Reproduction steps below. Any help much appreciated :)

Steps/Code to Reproduce

>>> import pandas as pd
>>> from sklearn.impute import KNNImputer
>>> knn = KNNImputer(add_indicator=True)
>>> df = pd.DataFrame({'A': [0, None], 'B': [1, 2]})
>>> df
     A  B
0  0.0  1
1  NaN  2
>>> knn.fit(df)
KNNImputer(add_indicator=True)
>>> pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())
     A    B  missingindicator_A

7099
0  0.0  1.0                 0.0
1  0.0  2.0                 1.0
>>> df['A'] = 0
>>> pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())

Expected Results

     A    B  missingindicator_A
0  0.0  1.0                 0.0
1  0.0  2.0                 0.0

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[30], line 1
----> 1 pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:694, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    684         mgr = dict_to_mgr(
    685             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    686             # attribute "name"
   (...)
    691             typ=manager,
    692         )
    693     else:
--> 694         mgr = ndarray_to_mgr(
    695             data,
    696             index,
    697             columns,
    698             dtype=dtype,
    699             copy=copy,
    700             typ=manager,
    701         )
    703 # For data is list-like, or Iterable (will consume into list)
    704 elif is_list_like(data):

File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/construction.py:351, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    346 # _prep_ndarray ensures that values.ndim == 2 at this point
    347 index, columns = _get_axes(
    348     values.shape[0], values.shape[1], index=index, columns=columns
    349 )
--> 351 _check_values_indices_shape_match(values, index, columns)
    353 if typ == "array":
    355     if issubclass(values.dtype.type, str):

File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/construction.py:422, in _check_values_indices_shape_match(values, index, columns)
    420 passed = values.shape
    421 implied = (len(index), len(columns))
--> 422 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (2, 2), indices imply (2, 3)

Versions

python3, sklearn = 1.2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0