8000 [WIP] DOC Explain missing value mechanisms by aperezlebel · Pull Request #23746 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] DOC Explain missing value mechanisms #23746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aperezlebel
Copy link
Contributor

Reference Issues/PRs

Addresses task 3 of #21967.

What does this implement/fix? Explain your changes.

Add a section to the "Imputation of missing values" doc to explain the missing value mechanisms.

Any other comments?

Work in progress

@aperezlebel aperezlebel changed the title [WIP] DOC Add draft of missing value mechanisms section [WIP] DOC Explain missing value mechanisms Jun 23, 2022
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already looking good.

In addition to the synthetic illustration that is good, I am wondering if we could illustrate with a specific application setting related to data collection such that we have formal and applicative aspects.

@@ -17,6 +17,31 @@ values, i.e., to infer them from the known part of the data. See the
:ref:`glossary` entry on imputation.


Missing value mechanisms
========================
Three mechanisms model data missingness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Three mechanisms model data missingness.
Three mechanisms model data missingness exist:

========================
Three mechanisms model data missingness.

* **Missing Completely At Random (MCAR)**: the missingness does not depend on data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about giving a concrete example for each mechanism to illustrate it.

:align: center
:scale: 20%

In the above example, X1 is always observed. In the first plot, X2 is masked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the above example, X1 is always observed. In the first plot, X2 is masked
In the above example, X1 is always observed. In the left-hand side plot, X2 is masked

:scale: 20%

In the above example, X1 is always observed. In the first plot, X2 is masked
independently of the values of (X1, X2), hence MCAR. In the second, X2 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
independently of the values of (X1, X2), hence MCAR. In the second, X2 is
independently of the values of (X1, X2), hence MCAR. In the middle, X2 is


In the above example, X1 is always observed. In the first plot, X2 is masked
independently of the values of (X1, X2), hence MCAR. In the second, X2 is
masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is
masked when X1 (observed) reaches some threshold, hence MAR. In the right-hand side plot, X2 is

@glemaitre glemaitre self-requested a review November 8, 2022 14:46
@glemaitre glemaitre removed their request for review January 10, 2023 17:25
@glemaitre
Copy link
Member

@aperezlebel do you want to address the comment and solve the conflict such that we merge this PR?


* **Missing Completely At Random (MCAR)**: the missingness does not depend on data.
* **Missing At Random (MAR)**: the missingness does not depend on underlying
missing values but can depend on observed ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the target variable y?

@@ -17,6 +17,31 @@ values, i.e., to infer them from the known part of the data. See the
:ref:`glossary` entry on imputation.


Missing value mechanisms
========================
Three mechanisms model data missingness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Three mechanisms model data missingness.
The machine learning literature typically distinguishes between the following
settings. Note that the names are not necessarily very intuitive:

* **Missing At Random (MAR)**: the missingness does not depend on underlying
missing values but can depend on observed ones.
* **Missing Not At Random (MNAR)**: the missingness depends on underlying missing
values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values.
values. Therefore, the missingness pattern can be statistically associated
with `y` in a supervised classification or regression setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0