-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Documenting missing-values practices #21967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I like the general plan. Thanks @A-pl!
I would ad that using |
You are right. No imputation referred to "no advanced imputation". I think that we even discussed briefly if it would make sense that |
we even discussed briefly if it would make sense that OneHotEncoder and OrdinalEncoder should allow for encoding missing values (instead of only being lenient) to simplify the pipeline.
This would be great.
However this is out-of-scope for the current plan :)
I agree.
If there is a consensus that the above is a good suggestion (which I really think it is), we could open an issue, but focus on it later.
|
#21988 is still letting We could change the passtrough behavior in a follow-up PR but that would require a deprecation cycle. |
I'm excited that this is being worked on. I recently tried to figure out what I thought was a simple thing: how should I decide which of |
I don't think that exact KNN can be less than n^2: it requires computing
the distances between each pair of samples.
|
In terms of computational time yes I agree, but in terms of memory footprint (which was the bottleneck of KNNImputer before #16397), it seems to be achieved in less than n^2 with the current implementation since the distance matrix is never stored in full but by chunks. |
Ah yes. Good point!
|
I think the images are quite clear, but I find the concept of "underlying missing values" a bit ambiguous. How can we clarify this point to users without a background on the terminology like myself? |
Yes I agree it could be clearer. I think we can add a paragraph before the figure that explains the setting: all samples have a truth value for all their feature but some of them are masked. We can also say a word on the fact that in practice some features are missing but have no true value behind (eg a feature "death date" for a person alive). |
Describe the issue linked to the documentation
Context
We discussed with @glemaitre and @GaelVaroquaux about documenting missing-values practices for prediction in scikit-learn as part of my PhD work at Inria (discussion here).
Indeed, the current documentation gives no recommendations on this point. We feel that there is now a better understanding of missing values in the context of supervised learning and thus that we have more hindsight on the theoretical and practical messages that would be helpful for the users than when the current documentation was written. We think it would be useful to restructure the documentation and examples to convey these messages.
Messages to convey
Main messages:
Side messages:
Take-home messages:
Ressources
Suggest a potential alternative/fix
After discussing with @glemaitre and @GaelVaroquaux, the following changes were suggested.
Big picture
The goal is to give the recommendations above, and have simple examples that convey the right intuitions (even simple simulated data can be didactic by showing the basic mechanisms).
Didactic purpose: intuitions on the fact that missing values may distort distributions. Short example.
a. IterativeImputer: give time complexity (algorithmic scalability) and say it is not a magic bullet in the face of structured missingness.
b. KNNImputer: terrible computational scalability.
c. SimpleImputer: does not work well with simple models.
Proposed roadmap
(refers to above, and can be detailed in a board)
The text was updated successfully, but these errors were encountered: