diff --git a/doc/images/missing_value_mechanisms.png b/doc/images/missing_value_mechanisms.png new file mode 100644 index 0000000000000..c582a99686a9a Binary files /dev/null and b/doc/images/missing_value_mechanisms.png differ diff --git a/doc/modules/impute.rst b/doc/modules/impute.rst index 2df6e0a76bd73..42ecb2fc0bec7 100644 --- a/doc/modules/impute.rst +++ b/doc/modules/impute.rst @@ -17,6 +17,31 @@ values, i.e., to infer them from the known part of the data. See the :ref:`glossary` entry on imputation. +Missing value mechanisms +======================== +Three mechanisms model data missingness. + +* **Missing Completely At Random (MCAR)**: the missingness does not depend on data. +* **Missing At Random (MAR)**: the missingness does not depend on underlying + missing values but can depend on observed ones. +* **Missing Not At Random (MNAR)**: the missingness depends on underlying missing + values. + +.. figure:: ../images/missing_value_mechanisms.png + :align: center + :scale: 20% + +In the above example, X1 is always observed. In the first plot, X2 is masked +independently of the values of (X1, X2), hence MCAR. In the second, X2 is +masked when X1 (observed) reaches some threshold, hence MAR. In the last, X2 is +masked when X2 reaches some threshold, hence MNAR. + +Conditional imputation (e.g. :class:`~sklearn.impute.IterativeImputer` or +:class:`~sklearn.impute.KNNImputer`) is guaranteed to work only for ignorable +missingness (i.e. MCAR or MAR settings). When missingness is seldom ignored, +i.e. MNAR setting, adding the mask (`add_indicator=True`) is needed as the missingness is +informative. In practice, real-world data are often MNAR. + Univariate vs. Multivariate Imputation ====================================== @@ -317,8 +342,8 @@ wrap this in a :class:`Pipeline` with a classifier (e.g., a Estimators that handle NaN values ================================= -Some estimators are designed to handle NaN values without preprocessing. -Below is the list of these estimators, classified by type +Some estimators are designed to handle NaN values without preprocessing. +Below is the list of these estimators, classified by type (cluster, regressor, classifier, transform) : .. allow_nan_estimators::