8000 KNNImputer add_indicator fails to persist where missing data had been present in training · Issue #26590 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

KNNImputer add_indicator fails to persist where missing data had been present in training #26590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
djwbamboo opened this issue Jun 15, 2023 · 6 comments · Fixed by #26600
Closed
Labels

Comments

@djwbamboo
Copy link
djwbamboo commented Jun 15, 2023

Describe the bug

Hello, I've encountered an issue where the KNNImputer fails to record the fields where there were missing data at the time when .fit is called, but not recognised if .transform is called on a dense matrix. I would have expected it to return a 2x3 matrix rather than 2x2, with missingindicator_A = False for all cases.

Reproduction steps below. Any help much appreciated :)

Steps/Code to Reproduce

>>> import pandas as pd
>>> from sklearn.impute import KNNImputer
>>> knn = KNNImputer(add_indicator=True)
>>> df = pd.DataFrame({'A': [0, None], 'B': [1, 2]})
>>> df
     A  B
0  0.0  1
1  NaN  2
>>> knn.fit(df)
KNNImputer(add_indicator=True)
>>> pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())
     A    B  missingindicator_A
0  0.0  1.0                 0.0
1  0.0  2.0                 1.0
>>> df['A'] = 0
>>> pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())

Expected Results

     A    B  missingindicator_A
0  0.0  1.0                 0.0
1  0.0  2.0                 0.0

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[30], line 1
----> 1 pd.DataFrame(knn.transform(df), columns=knn.get_feature_names_out())

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:694, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    684         mgr = dict_to_mgr(
    685             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    686             # attribute "name"
   (...)
    691             typ=manager,
    692         )
    693     else:
--> 694         mgr = ndarray_to_mgr(
    695             data,
    696             index,
    697             columns,
    698             dtype=dtype,
    699             copy=copy,
    700             typ=manager,
    701         )
    703 # For data is list-like, or Iterable (will consume into list)
    704 elif is_list_like(data):

File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/construction.py:351, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    346 # _prep_ndarray ensures that values.ndim == 2 at this point
    347 index, columns = _get_axes(
    348     values.shape[0], values.shape[1], index=index, columns=columns
    349 )
--> 351 _check_values_indices_shape_match(values, index, columns)
    353 if typ == "array":
    355     if issubclass(values.dtype.type, str):

File /opt/conda/lib/python3.10/site-packages/pandas/core/internals/construction.py:422, in _check_values_indices_shape_match(values, index, columns)
    420 passed = values.shape
    421 implied = (len(index), len(columns))
--> 422 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (2, 2), indices imply (2, 3)

Versions

python3, sklearn = 1.2.1
@djwbamboo djwbamboo added Bug Needs Triage Issue requires triage labels Jun 15, 2023
@Shreesha3112
Copy link
Contributor

Currently transform method of KNNImputer returns input X without adding missing columns, if there are no missing values in X regardless of wether there are missing_indicators or not . I think it needs to be fixed

mask = _get_mask(X, self.missing_values)
mask_fit_X = self._mask_fit_X
valid_mask = self._valid_mask
X_indicator = super()._transform_indicator(mask)
# Removes columns where the training data is all nan
if not np.any(mask):
# No missing values in X
if self.keep_empty_features:
Xc = X
Xc[:, ~valid_mask] = 0
else:
Xc = X[:, valid_mask]
return Xc

@Shreesha3112
Copy link
Contributor
Shreesha3112 commented Jun 16, 2023

We can fix it by concatenating with the indicator even if there's no missing values.

        # Removes columns where the training data is all nan
        if not np.any(mask):
            # No missing values in X
            if self.keep_empty_features:
                Xc = X
                Xc[:, ~valid_mask] = 0
            else:
                Xc = X[:, valid_mask]

            # Concatenate with the indicator even if there's no missing values
            return super()._concatenate_indicator(Xc, X_indicator)

test:

import pandas as pd
from sklearn.impute import KNNImputer
knn = KNNImputer(add_indicator=True)
input_df = pd.DataFrame({'A': [0, None], 'B': [1, 2]})
print('Input dataframe with missing values used in fit: \n',input_df)
knn.fit(input_df)
transformed_df = pd.DataFrame(knn.transform(input_df), columns=knn.get_feature_names_out())
print("input transformed \n",transformed_df)
print("\n\n-------------------\n\n")
df_with_no_missing_values = input_df.copy()
df_with_no_missing_values['A'] = 0
print("df with no missing values \n",df_with_no_missing_values)
df_with_no_missing_values_transformed = pd.DataFrame(knn.transform(df_with_no_missing_values),
                                                     columns=knn.get_feature_names_out())
print("df with no missing values after transformation\n",df_with_no_missing_values_transformed)

output

Input dataframe with missing values used in fit: 
      A  B
0  0.0  1
1  NaN  2
input transformed 
      A    B  missingindicator_A
0  0.0  1.0                 0.0
1  0.0  2.0                 1.0


-------------------


df with no missing values 
    A  B
0  0  1
1  0  2
df with no missing values after transformation
      A    B  missingindicator_A
0  0.0  1.0                 0.0
1  0.0  2.0                 0.0

@djwbamboo
Copy link
Author

That looks great Shreesha! Thanks for looking into this so quickly :)

@Shreesha3112
Copy link
Contributor
Shreesha3112 commented Jun 16, 2023

That looks great Shreesha! Thanks for looking into this so quickly :)

I will create a PR if reviewers happy with the approach

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Jun 16, 2023
@ogrisel
Copy link
Member
ogrisel commented Jun 16, 2023

Not sure if I understand the details but please feel free to open a PR, as it will make it easier to grasp the details and it definitely looks like a bug from afar.

@Shreesha3112
Copy link
Contributor

Ok will open PR for this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0