-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] DOC Explain missing value mechanisms #23746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already looking good.
In addition to the synthetic illustration that is good, I am wondering if we could illustrate with a specific application setting related to data collection such that we have formal and applicative aspects.
@@ -17,6 +17,31 @@ values, i.e., to infer them from the known part of the data. See the | |||
:ref:`glossary` entry on imputation. | |||
|
|||
|
|||
Missing value mechanisms | |||
======================== | |||
Three mechanisms model data missingness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three mechanisms model data missingness. | |
Three mechanisms model data missingness exist: |
======================== | ||
Three mechanisms model data missingness. | ||
|
||
* **Missing Completely At Random (MCAR)**: the missingness does not depend on data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about giving a concrete example for each mechanism to illustrate it.
:align: center | ||
:scale: 20% | ||
|
||
In the above example, X1 is always observed. In the first plot, X2 is masked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above example, X1 is always observed. In the first plot, X2 is masked | |
In the above example, X1 is always observed. In the left-hand side plot, X2 is masked |
:scale: 20% | ||
|
||
In the above example, X1 is always observed. In the first plot, X2 is masked | ||
independently of the values of (X1, X2), hence MCAR. In the second, X2 is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
independently of the values of (X1, X2), hence MCAR. In the second, X2 is | |
independently of the values of (X1, X2), hence MCAR. In the middle, X2 is |
|
||
In the above example, X1 is always observed. In the first plot, X2 is masked | ||
independently of the values of (X1, X2), hence MCAR. In the second, X2 is | ||
masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is | |
masked when X1 (observed) reaches some threshold, hence MAR. In the right-hand side plot, X2 is |
@aperezlebel do you want to address the comment and solve the conflict such that we merge this PR? |
|
||
* **Missing Completely At Random (MCAR)**: the missingness does not depend on data. | ||
* **Missing At Random (MAR)**: the missingness does not depend on underlying | ||
missing values but can depend on observed ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including the target variable y
?
@@ -17,6 +17,31 @@ values, i.e., to infer them from the known part of the data. See the | |||
:ref:`glossary` entry on imputation. | |||
|
|||
|
|||
Missing value mechanisms | |||
======================== | |||
Three mechanisms model data missingness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three mechanisms model data missingness. | |
The machine learning literature typically distinguishes between the following | |
settings. Note that the names are not necessarily very intuitive: |
* **Missing At Random (MAR)**: the missingness does not depend on underlying | ||
missing values but can depend on observed ones. | ||
* **Missing Not At Random (MNAR)**: the missingness depends on underlying missing | ||
values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values. | |
values. Therefore, the missingness pattern can be statistically associated | |
with `y` in a supervised classification or regression setting. |
Reference Issues/PRs
Addresses task 3 of #21967.
What does this implement/fix? Explain your changes.
Add a section to the "Imputation of missing values" doc to explain the missing value mechanisms.
Any other comments?
Work in progress