8000 FIX KNNImputer missing indicator column addition when add_indicator=True by Shreesha3112 · Pull Request #26600 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

FIX KNNImputer missing indicator column addition when add_indicator=True #26600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Shreesha3112
Copy link
Contributor
@Shreesha3112 Shreesha3112 commented Jun 18, 2023

Reference Issues/PRs

Fixes #26590

What does this implement/fix? Explain your changes.

Previously, the KNNImputer did not add a missing indicator column in transform method, when the add_indicator parameter was set to True and no missing values are present in the input. This could lead to inconsistent outputs between training and testing datasets for example.

This bug fix ensures that KNNImputer correctly adds a missing indicator column in its transform output when the add_indicator is set to True, even if no missing values are present in the input.

Any other comments?

@github-actions
Copy link
github-actions bot commented Jun 25, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 1244383. Link to the linter CI: here

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @Shreesha3112 !


@pytest.mark.parametrize("add_indicator", [True, False])
@pytest.mark.parametrize("missing_value_test", [np.NaN, 1])
def test_knn_imputation_adds_missing_indicator_column_if_add_indicator_is_true(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this property applies to all imputers, thus I think it's better to have a test in sklearn/impute/tests/test_common.py:

@pytest.mark.parametrize("imputer", imputers(), ids=lambda x: x.__class__.__name__)
@pytest.mark.parametrize("missing_value_test", [np.nan, 1])
def test_imputation_adds_missing_indicator_if_add_indicator_is_true(
    imputer, missing_value_test
):
    """Check that missing indicator always exists when add_indicator=True.

    Non-regression test for gh-26590.
    """
    X_train = np.array([[0, np.NaN], [1, 2]])

    # Test data without missing values that can be set to np.NaN or 1.
    X_test = np.array([[0, missing_value_test], [1, 2]])

    imputer.set_params(add_indicator=True)
    imputer.fit(X_train)

    X_test_imputed_with_indicator = imputer.transform(X_test)
    assert X_test_imputed_with_indicator.shape == (2, 3)

    imputer.set_params(add_indicator=False)
    imputer.fit(X_train)
    X_test_imputed_without_indicator = imputer.transform(X_test)
    assert X_test_imputed_without_indicator.shape == (2, 2)

    assert_allclose(
        X_test_imputed_with_indicator[:, :-1], X_test_imputed_without_indicator
    )
    if np.isnan(missing_value_test):
        expected_missing_indicator = [1, 0]
    else:
        expected_missing_indicator = [0, 0]

    assert_allclose(X_test_imputed_with_indicator[:, -1], expected_missing_indicator)

Given the original bug report, I am okay with checking add_in 8000 dicator=True, which is also consistent with the name of the test "*_if_add_indicator_is_true".

:mod:`sklearn.impute`
.....................

- |Fix| :class:`impute.KNNImputer` now correctly adds a missing indicator column in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this changelog entry be moved to doc/whats_new/1.3.rst under 1.3.1 ?

@thomasjpfan thomasjpfan added this to the 1.3.1 milestone Jul 26, 2023
@thomasjpfan thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jul 26, 2023
Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update! LGTM

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jul 27, 2023
@thomasjpfan thomasjpfan changed the title Fix KNNImputer missing indicator column addition when add_indicator=True FIX KNNImputer missing indicator column addition when add_indicator=True Jul 27, 2023
@haiatn
Copy link
Contributor
haiatn commented Jul 29, 2023

Hoping this will be merged soon! Thank you for this PR

Copy link
Contributor
@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Looks good overall.

@OmarManzoor OmarManzoor removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Aug 8, 2023
@thomasjpfan thomasjpfan enabled auto-merge (squash) August 8, 2023 22:04
@thomasjpfan thomasjpfan merged commit 687465f into scikit-learn:main Aug 8, 2023
@Shreesha3112 Shreesha3112 deleted the fix/knnimputer-missing-indicator-behavior branch August 9, 2023 04:39
TamaraAtanasoska pushed a commit to TamaraAtanasoska/scikit-learn that referenced this pull request Aug 21, 2023
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 18, 2023
jeremiedbb pushed a commit that referenced this pull request Sep 20, 2023
…rue (#26600)

Co-authored-by: shreesha3112 <shreesha3112.com>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:impute To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KNNImputer add_indicator fails to persist where missing data had been present in training
4 participants
0