8000 [MRG+2] Rename `MICEImputer` to `ChainedImputer` by sergeyf · Pull Request #11314 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] Rename MICEImputer to ChainedImputer #11314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 22, 2018

Conversation

sergeyf
Copy link
Contributor
@sergeyf sergeyf commented Jun 18, 2018

Addresses a point raised in #8478 and #11259. The MICEImputer isn't really a MICE imputer because we're only keeping a single averaged imputation by default. This is a purely cosmetic PR so that end-users don't get as confused.

CC @glemaitre @jnothman @RianneSchouten

@sergeyf sergeyf changed the title Rename MICEImputer to ChainedImputer [MRG] Rename MICEImputer to ChainedImputer Jun 18, 2018
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sergeyf sergeyf changed the title [MRG] Rename MICEImputer to ChainedImputer [MRG+1] Rename MICEImputer to ChainedImputer Jun 19, 2018
@sergeyf sergeyf closed this Jun 19, 2018
@sergeyf sergeyf reopened this Jun 19, 2018
@sergeyf
Copy link
Contributor Author
sergeyf commented Jun 20, 2018

Getting some urllib.error.HTTPError: HTTP Error 500: Internal Server Error =(. Will wait a few hours and try again.

@sergeyf sergeyf closed this Jun 20, 2018
@sergeyf sergeyf reopened this Jun 20, 2018
@sergeyf sergeyf closed this Jun 20, 2018
@sergeyf sergeyf reopened this Jun 20, 2018
@sergeyf sergeyf closed this Jun 20, 2018
@sergeyf sergeyf reopened this Jun 20, 2018
@qinhanmin2014
Copy link
Member

@sergeyf Master is failing, please merge master in after #11318 is merged.

@qinhanmin2014
Copy link
Member

@sergeyf Any reference for the new name?
Also, I still find things like A more sophisticated approach is to use the :class:`ChainedImputer` class, which implements the the technique of Multivariate Imputation by Chained Equations. MICE..., which makes me confused.

@sergeyf
Copy link
Contributor Author
sergeyf commented Jun 20, 2018

@qinhanmin2014 thanks re: master, I'll do that.

Re: name. So the point here is that we're not actually doing MICE, because the traditional way of applying the MICE technique is to get multiple imputations, and this class only returns a single one. But the full name is multivariate imputation by chained equations (MICE), hence the named ChainedImputer. I guess I can put that language in directly.

@jnothman
Copy link
Member

Sorry, I broke master.

@jnothman
Copy link
Member

Part of the confusion is that:

  • the paper says MICE = Multiple Imputation by Chained Equations
  • the R implementation says MICE = Multivariate Imputation by Chained Equations

@sergeyf
Copy link
Contributor Author
sergeyf commented Jun 20, 2018

@jnothman Which paper? Stef's paper that I reference is this one: "mice: Multivariate Imputation by Chained Equations in R" https://www.jstatsoft.org/article/view/v045i03

There's also this paper, but it's not referenced in this PR: "Multiple Imputation by Chained Equations: What is it and how does it work?" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

@qinhanmin2014 will you please take a look at my changes to impute.rst and let me know if it's sufficiently clear now.

@jnothman
Copy link
Member

You're right. I've just been confused by people implying that the M must mean Multiple.... because of course what we're doing here is Multivariate Imputation by Chaining.... just not Multiple.

@sergeyf
Copy link
Contributor Author
sergeyf commented Jun 20, 2018

I have confused that at least 5 times while writing the original PR =)

A more sophisticated approach is to use the :class:`ChainedImputer` class, which
implements MICE: Multivariate Imputation by Chained Equations. MICE is usually used
to generate multiple imputations, but :class:`ChainedImputer` generates a single
(averaged) imputation, which is why it is not named `MICEImputer`. MICE models each
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to link those terms to a glossary entry that explains:

  • univariate vs multivariate imputation
  • single vs multiple imputation (increasing the number of samples in the resulting transformed dataset).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. There is no entry for imputation. I'll add one and include these points.

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the above comment on the missing glossary entry, +1 for merge.

@ogrisel ogrisel added this to the 0.20 milestone Jun 20, 2018
@sergeyf
Copy link
Contributor Author
sergeyf commented Jun 20, 2018

@ogrisel I added a glossary entry and rewrote the entries in impute.rst. Please take a look and confirm that it satisfies your request.

@sergeyf sergeyf changed the title [MRG+1] Rename MICEImputer to ChainedImputer [MRG+2] Rename MICEImputer to ChainedImputer Jun 20, 2018
@RianneSchouten
Copy link
RianneSchouten commented Jun 20, 2018 via email

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glossary entry for imputation is significantly longer than the average entry.

Maybe we should keep a short entry in the glossary and move the discussing on univariate vs multivariate vs single vs multiple imputations in a dedicated section at the end of the missing value imputation chapter of the narrative documentation.

Alternatively, we could split the entries: univariate imputation, multivariate imputation, multiple imputations and add cross-links. See for instance the entries on "inductive" vs "transductive".

impute
imputation
Most machine learning algorithms require that their inputs have no
:term:`missing values`, and will not work if this requirement is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a backlink to the :term:imputation entry from the :term:missing values entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

doc/glossary.rst Outdated
The above practice is called multiple imputation. The
:class:`impute.ChainedImputer` class can be used for multiple
imputations by applying it to the same dataset with different random
seeds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would further refine:

... imputations by applying it repeatedly to the same dataset with different random seeds.
Note that a call to the transform method of a :term:transformer is not allowed to change the number of samples. Therefore multiple imputations cannot be achieved by a single call to transform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add.

@sergeyf
Copy link
Contributor Author
sergeyf commented Jun 21, 2018

@ogrisel Done! I moved most of the discussion to impute.rst

F438

@sergeyf
Copy link
Contributor Author
sergeyf commented Jun 21, 2018

Hmm tests are failing because of StandardScaler?

=================================== FAILURES ===================================
 test_non_meta_estimators[StandardScaler-StandardScaler-check_estimators_nan_inf] 
name = 'StandardScaler'
Estimator = <class 'sklearn.preprocessing.data.StandardScaler'>
check = <function check_estimators_nan_inf at 0x7fe165faca60>

@@ -77,8 +77,8 @@
'RANSACRegressor', 'RadiusNeighborsRegressor',
'RandomForestRegressor', 'Ridge', 'RidgeCV']

ALLOW_NAN = ['Imputer', 'SimpleImputer', 'MICEImputer',
'MinMaxScaler', 'StandardScaler', 'QuantileTransformer']
ALLOW_NAN = ['Imputer', 'SimpleImputer', 'ChainedImputer',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the StandardScaler back here to remove the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dang that was my bad on the merging. Thanks!

@ogrisel
Copy link
Member
ogrisel commented Jun 22, 2018

Thank you very much @sergeyf!

@ogrisel ogrisel merged commit 93382cc into scikit-learn:master Jun 22, 2018
@sergeyf sergeyf deleted the mice_rename branch June 22, 2018 15:48
jorisvandenbossche added a commit to jorisvandenbossche/scikit-learn that referenced this pull request Jul 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0