8000 [WIP] DOC Explain missing value mechanisms by aperezlebel · Pull Request #23746 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[WIP] DOC Explain missing value mechanisms #23746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added doc/images/missing_value_mechanisms.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 27 additions & 2 deletions doc/modules/impute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,31 @@ values, i.e., to infer them from the known part of the data. See the
:ref:`glossary` entry on imputation.


Missing value mechanisms
========================
Three mechanisms model data missingness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Three mechanisms model data missingness.
Three mechanisms model data missingness exist:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Three mechanisms model data missingness.
The machine learning literature typically distinguishes between the following
settings. Note that the names are not necessarily very intuitive:


* **Missing Completely At Random (MCAR)**: the missingness does not depend on data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about giving a concrete example for each mechanism to illustrate it.

* **Missing At Random (MAR)**: the missingness does not depend on underlying
missing values but can depend on observed ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the target variable y?

* **Missing Not At Random (MNAR)**: the missingness depends on underlying missing
values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values.
values. Therefore, the missingness pattern can be statistically associated
with `y` in a supervised classification or regression setting.


.. figure:: ../images/missing_value_mechanisms.png
:align: center
:scale: 20%

In the above example, X1 is always observed. In the first plot, X2 is masked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the above example, X1 is always observed. In the first plot, X2 is masked
In the above example, X1 is always observed. In the left-hand side plot, X2 is masked

independently of the values of (X1, X2), hence MCAR. In the second, X2 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
independently of the values of (X1, X2), hence MCAR. In the second, X2 is
independently of the values of (X1, X2), hence MCAR. In the middle, X2 is

masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is
masked when X1 (observed) reaches some threshold, hence MAR. In the right-hand side plot, X2 is

masked when X2 reaches some threshold, hence MNAR.

Conditional imputation (e.g. :class:`~sklearn.impute.IterativeImputer` or
:class:`~sklearn.impute.KNNImputer`) is guaranteed to work only for ignorable
missingness (i.e. MCAR or MAR settings). When missingness is seldom ignored,
i.e. MNAR setting, adding the mask (`add_indicator=True`) is needed as the missingness is
informative. In practice, real-world data are often MNAR.

Univariate vs. Multivariate Imputation
======================================

Expand Down Expand Up @@ -317,8 +342,8 @@ wrap this in a :class:`Pipeline` with a classifier (e.g., a
Estimators that handle NaN values
=================================

Some estimators are designed to handle NaN values without preprocessing.
Below is the list of these estimators, classified by type
Some estimators are designed to handle NaN values without preprocessing.
Below is the list of these estimators, classified by type
(cluster, regressor, classifier, transform) :

.. allow_nan_estimators::
0