Documenting missing-values practices

@glemaitre

Describe the issue linked to the documentation

Context

We discussed with @glemaitre and @GaelVaroquaux about documenting missing-values practices for prediction in scikit-learn as part of my PhD work at Inria (discussion here).

Indeed, the current documentation gives no recommendations on this point. We feel that there is now a better understanding of missing values in the context of supervised learning and thus that we have more hindsight on the theoretical and practical messages that would be helpful for the users than when the current documentation was written. We think it would be useful to restructure the documentation and examples to convey these messages.

Messages to convey

Main messages:

Missing values in inference and in supervised learning are different problems with different tradeoffs. Define the terms and highlight the differences.
Don't impute jointly training and test sets: data leakage and can't use in production.
Simpler learners need powerful imputation (e.g conditional imputation with Iterative Imputer). Define conditional imputation. (theoretical arguments can be found in LeMorvan 2020).
Conditional imputation is guaranteed to work only for "ignorable missingness" (Missing At Random mechanism, to define). Otherwise, the mask 80EF is needed (missingness is seldom ignored: the data are missing for a reason). Wikipedia pages on missing values can justify this.
Powerful learners + simple imputation or no imputation works best (robustness to missingness mechanisms and flexibility), e.g HistGradientBoosting (this comes from experience, including systematic benchmarks).
For categorical features, impute missing values as a new category (imputing to an existing category destroys information important to the learner).
Computation cost of imputation can quickly get large, and even intractable for the most costly methods (e.g IterativeImputer, KNNImputer).

Side messages:

The optimal predictor on partially-observed values is not always "good" imputation + the optimal predictor on the fully-observed values (Le Morvan et al. 2021). You need to account for missingness in some way.
For multiple imputation, need to separate training and test behaviors (cf main message 1 above).
As a consequence, ensemble methods, such as bagging, are a good solution for implementing multiple imputation in practice (a single supervised learning applied to many imputations is likely severely suboptimal).

Take-home messages:

If little data: use conditional imputation and simple learners.
If a lot of data (n>1000), use HistGradientBoosting.
Don't impute categorical variables.

Ressources

Wikipedia Missing data
Josse 2019 On the consistency of supervised learning with missing values
LeMorvan 2020 Linear predictor on linearly-generated data with missing values: non consistency and solutions
LeMorvan 2021 What's a good imputation to predict with missing values?
Perez-Lebel 2022 Benchmarking missing-values approaches for predictive models on health databases

Suggest a potential alternative/fix

After discussing with @glemaitre and @GaelVaroquaux, the following changes were suggested.

Big picture

The goal is to give the recommendations above, and have simple examples that convey the right intuitions (even simple simulated data can be didactic by showing the basic mechanisms).

Write a narrative doc page that gives the big picture messages listed above and some figures.
Replace the current examples about imputation that don't give a clear message.
An example to generate simple figures explaining the difference between Missing At Random and Missing Not At Random (as in https://www.slideshare.net/GaelVaroquaux/dirty-data-science-machine-learning-on-noncurated-data/27).
Didactic purpose: intuitions on the fact that missing values may distort distributions. Short example.
An example to develop intuitions on imputation with the interplay between imputation and learning, adapted from http://dirtydata.science/python/gen_notes/01_missing_values.html, but only the two first sections. Didactic purpose: How the mechanism + imputation modifies the link between X and y.
Adapt the docstrings to give local recommendations:
a. IterativeImputer: give time complexity (algorithmic scalability) and say it is not a magic bullet in the face of structured missingness.
b. KNNImputer: terrible computational scalability.
c. SimpleImputer: does not work well with simple models.

Proposed roadmap

(refers to above, and can be detailed in a board)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Describe the issue linked to the documentation

Context

Messages to convey

Main messages:

Side messages:

Take-home messages:

Ressources

Suggest a potential alternative/fix

Big picture

Proposed roadmap

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Describe the issue linked to the documentation

Context

Messages to convey

Main messages:

Side messages:

Take-home messages:

Ressources

Suggest a potential alternative/fix

Big picture

Proposed roadmap

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions