-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] DOC Explain missing value mechanisms #23746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -17,6 +17,31 @@ values, i.e., to infer them from the known part of the data. See the | |||||||
:ref:`glossary` entry on imputation. | ||||||||
|
||||||||
|
||||||||
Missing value mechanisms | ||||||||
======================== | ||||||||
Three mechanisms model data missingness. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
||||||||
* **Missing Completely At Random (MCAR)**: the missingness does not depend on data. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about giving a concrete example for each mechanism to illustrate it. |
||||||||
* **Missing At Random (MAR)**: the missingness does not depend on underlying | ||||||||
missing values but can depend on observed ones. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Including the target variable |
||||||||
* **Missing Not At Random (MNAR)**: the missingness depends on underlying missing | ||||||||
values. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
||||||||
.. figure:: ../images/missing_value_mechanisms.png | ||||||||
:align: center | ||||||||
:scale: 20% | ||||||||
|
||||||||
In the above example, X1 is always observed. In the first plot, X2 is masked | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
independently of the values of (X1, X2), hence MCAR. In the second, X2 is | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
masked when X2 reaches some threshold, hence MNAR. | ||||||||
|
||||||||
Conditional imputation (e.g. :class:`~sklearn.impute.IterativeImputer` or | ||||||||
:class:`~sklearn.impute.KNNImputer`) is guaranteed to work only for ignorable | ||||||||
missingness (i.e. MCAR or MAR settings). When missingness is seldom ignored, | ||||||||
i.e. MNAR setting, adding the mask (`add_indicator=True`) is needed as the missingness is | ||||||||
informative. In practice, real-world data are often MNAR. | ||||||||
|
||||||||
Univariate vs. Multivariate Imputation | ||||||||
====================================== | ||||||||
|
||||||||
|
@@ -317,8 +342,8 @@ wrap this in a :class:`Pipeline` with a classifier (e.g., a | |||||||
Estimators that handle NaN values | ||||||||
================================= | ||||||||
|
||||||||
Some estimators are designed to handle NaN values without preprocessing. | ||||||||
Below is the list of these estimators, classified by type | ||||||||
Some estimators are designed to handle NaN values without preprocessing. | ||||||||
Below is the list of these estimators, classified by type | ||||||||
(cluster, regressor, classifier, transform) : | ||||||||
|
||||||||
.. allow_nan_estimators:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.