scikit-learn · jnothman · Jan 25, 2019 · Sep 17, 2018 · Sep 17, 2018 · Sep 17, 2018
diff --git a/doc/modules/impute.rst b/doc/modules/impute.rst
@@ -9,19 +9,19 @@ Imputation of missing values
 For various reasons, many real world datasets contain missing values, often
 encoded as blanks, NaNs or other placeholders. Such datasets however are
 incompatible with scikit-learn estimators which assume that all values in an
-array are numerical, and that all have and hold meaning. A basic strategy to use
-incomplete datasets is to discard entire rows and/or columns containing missing
-values. However, this comes at the price of losing data which may be valuable
-(even though incomplete). A better strategy is to impute the missing values,
-i.e., to infer them from the known part of the data. See the :ref:`glossary`
-entry on imputation.
+array are numerical, and that all have and hold meaning. A basic strategy to
+use incomplete datasets is to discard entire rows and/or columns containing
+missing values. However, this comes at the price of losing data which may be
+valuable (even though incomplete). A better strategy is to impute the missing
+values, i.e., to infer them from the known part of the data. See the
+:ref:`glossary` entry on imputation.
 
 
 Univariate vs. Multivariate Imputation
 ======================================
 
-One type of imputation algorithm is univariate, which imputes values in the i-th
-feature dimension using only non-missing values in that feature dimension
+One type of imputation algorithm is univariate, which imputes values in the
+i-th feature dimension using only non-missing values in that feature dimension
 (e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
 algorithms use the entire set of available feature dimensions to estimate the
 missing values (e.g. :class:`impute.IterativeImputer`).
@@ -66,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
      [6. 3.]
      [7. 6.]]
 
-Note that this format is not meant to be used to implicitly store missing values
-in the matrix because it would densify it at transform time. Missing values encoded
-by 0 must be used with dense input.
+Note that this format is not meant to be used to implicitly store missing
+values in the matrix because it would densify it at transform time. Missing
+values encoded by 0 must be used with dense input.
 
 The :class:`SimpleImputer` class also supports categorical data represented as
 string values or pandas categoricals when using the ``'most_frequent'`` or
@@ -110,31 +110,43 @@ round are returned.
     IterativeImputer(imputation_order='ascending', initial_strategy='mean',
         max_value=None, min_value=None, missing_values=nan, n_iter=10,
         n_nearest_features=None, predictor=None, random_state=0,
-        sample_posterior=False, verbose=False)
+        sample_posterior=False, verbose=0)
     >>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
     >>> # the model learns that the second feature is double the first
     >>> print(np.round(imp.transform(X_test)))
     [[ 1.  2.]
      [ 6. 12.]
      [ 3.  6.]]
 
-Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a Pipeline
-as a way to build a composite estimator that supports imputation.
+Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a
+Pipeline as a way to build a composite estimator that supports imputation.
 See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`.
 
+Flexibility of IterativeImputer
+-------------------------------
+
+There are many well-established imputation packages in the R data science
+ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
+out to be a particular instance of different sequential imputation algorithms
+that can all be implemented with :class:`IterativeImputer` by passing in
+different regressors to be used for predicting missing feature values. In the
+case of missForest, this regressor is a Random Forest.
+See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`.
+
+
 .. _multiple_imputation:
 
 Multiple vs. Single Imputation
-==============================
+------------------------------
 
-In the statistics community, it is common practice to perform multiple imputations,
-generating, for example, ``m`` separate imputations for a single feature matrix.
-Each of these ``m`` imputations is then put through the subsequent analysis pipeline
-(e.g. feature engineering, clustering, regression, classification). The ``m`` final
-analysis results (e.g. held-out validation errors) allow the data scientist
-to obtain understanding of how analytic results may differ as a consequence
-of the inherent uncertainty caused by the missing values. The above practice
-is called multiple imputation.
+In the statistics community, it is common practice to perform multiple
+imputations, generating, for example, ``m`` separate imputations for a single
+feature matrix. Each of these ``m`` imputations is then put through the
+subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
+classification). The ``m`` final analysis results (e.g. held-out validation
+errors) allow the data scientist to obtain understanding of how analytic
+results may differ as a consequence of the inherent uncertainty caused by the
+missing values. The above practice is called multiple imputation.
 
 Our implementation of :class:`IterativeImputer` was inspired by the R MICE
 package (Multivariate Imputation by Chained Equations) [1]_, but differs from
@@ -144,13 +156,13 @@ it repeatedly to the same dataset with different random seeds when
 ``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple
 vs. single imputations.
 
-It is still an open problem as to how useful single vs. multiple imputation is in
-the context of prediction and classification when the user is not interested in
-measuring uncertainty due to missing values.
+It is still an open problem as to how useful single vs. multiple imputation is
+in the context of prediction and classification when the user is not
+interested in measuring uncertainty due to missing values.
 
-Note that a call to the ``transform`` method of :class:`IterativeImputer` is not
-allowed to change the number of samples. Therefore multiple imputations cannot be
-achieved by a single call to ``transform``.
+Note that a call to the ``transform`` method of :class:`IterativeImputer` is
+not allowed to change the number of samples. Therefore multiple imputations
+cannot be achieved by a single call to ``transform``.
 
 References
 ==========

diff --git a/examples/impute/plot_iterative_imputer_variants_comparison.py b/examples/impute/plot_iterative_imputer_variants_comparison.py
@@ -0,0 +1,126 @@
+"""
+=========================================================
+Imputing missing values with variants of IterativeImputer
+=========================================================
+
+The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be
+used with a variety of predictors to do round-robin regression, treating every
+variable as an output in turn.
+
+In this example we compare some predictors for the purpose of missing feature
+imputation with :class:`sklearn.imputeIterativeImputer`::
+
+    :class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression
+    :class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression
+    :class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
+    :class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN
+    imputation approaches
+
+Of particular interest is the ability of
+:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a
+popular imputation package for R. In this example, we have chosen to use
+:class:`sklearn.ensemble.ExtraTreesRegressor` instead of
+:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its
+increased speed.
+
+Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN
+imputation, which learns from samples with missing values by using a distance
+metric that accounts for missing values, rather than imputing them.
+
+The goal is to compare different predictors to see which one is best for the
+:class:`sklearn.impute.IterativeImputer` when using a
+:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing
+dataset with a single value randomly removed from each row.
+
+For this particular pattern of missing values we see that
+:class:`sklearn.ensemble.ExtraTreesRegressor` and
+:class:`sklearn.linear_model.BayesianRidge` give the best results.
+"""
+print(__doc__)
+
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+
+from sklearn.datasets import fetch_california_housing
+from sklearn.impute import SimpleImputer
+from sklearn.impute import IterativeImputer
+from sklearn.linear_model import BayesianRidge
+from sklearn.tree import DecisionTreeRegressor
+from sklearn.ensemble import ExtraTreesRegressor
+from sklearn.neighbors import KNeighborsRegressor
+from sklearn.pipeline import make_pipeline
+from sklearn.model_selection import cross_val_score
+
+N_SPLITS = 5
+
+rng = np.random.RandomState(0)
+
+X_full, y_full = fetch_california_housing(return_X_y=True)
+n_samples, n_features = X_full.shape
+
+# Estimate the score on the entire dataset, with no missing values
+br_estimator = BayesianRidge()
+score_full_data = pd.DataFrame(
+    cross_val_score(
+        br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
+        cv=N_SPLITS
+    ),
+    columns=['Full Data']
+)
+
+# Add a single missing value to each row
+X_missing = X_full.copy()
+y_missing = y_full
+missing_samples = np.arange(n_samples)
+missing_features = rng.choice(n_features, n_samples, replace=True)
+X_missing[missing_samples, missing_features] = np.nan
+
+# Estimate the score after imputation (mean and median strategies)
+score_simple_imputer = pd.DataFrame()
+for strategy in ('mean', 'median'):
+    estimator = make_pipeline(
+        SimpleImputer(missing_values=np.nan, strategy=strategy),
+        br_estimator
+    )
+    score_simple_imputer[strategy] = cross_val_score(
+        estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
+        cv=N_SPLITS
+    )
+
+# Estimate the score after iterative imputation of the missing values
+# with different predictors
+predictors = [
+    BayesianRidge(),
+    DecisionTreeRegressor(max_features='sqrt', random_state=0),
+    ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0),
+    KNeighborsRegressor(n_neighbors=15)
+]
+score_iterative_imputer = pd.DataFrame()
+for predictor in predictors:
+    estimator = make_pipeline(
+        IterativeImputer(random_state=0, predictor=predictor),
+        br_estimator
+    )
+    score_iterative_imputer[predictor.__class__.__name__] = \
+        cross_val_score(
+            estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
+            cv=N_SPLITS
+        )
+
+scores = pd.concat(
+    [score_full_data, score_simple_imputer, score_iterative_imputer],
+    keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
+)
+
+# plot boston results
+fig, ax = plt.subplots(figsize=(13, 6))
+means = -scores.mean()
+errors = scores.std()
+means.plot.barh(xerr=errors, ax=ax)
+ax.set_title('California Housing Regression with Different Imputation Methods')
+ax.set_xlabel('MSE (smaller is better)')
+ax.set_yticks(np.arange(means.shape[0]))
+ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])
+plt.tight_layout(pad=1)
+plt.show()
diff --git a/examples/impute/plot_missing_values.py b/examples/impute/plot_missing_values.py
@@ -12,12 +12,13 @@
 round-robin linear regression, treating every variable as an output in
 turn. The version implemented assumes Gaussian (output) variables. If your
 features are obviously non-Normal, consider transforming them to look more
-Normal so as to improve performance.
+Normal so as to potentially improve performance.
 
 In addition of using an imputing method, we can also keep an indication of the
 missing information using :func:`sklearn.impute.MissingIndicator` which might
 carry some information.
 """
+print(__doc__)
 
 import numpy as np
 import matplotlib.pyplot as plt
@@ -31,16 +32,29 @@
 
 rng = np.random.RandomState(0)
 
+N_SPLITS = 5
+REGRESSOR = RandomForestRegressor(random_state=0, n_estimators=100)
+
+
+def get_scores_for_imputer(imputer, X_missing, y_missing):
+    estimator = make_pipeline(
+        make_union(imputer, MissingIndicator(missing_values=0)),
+        REGRESSOR)
+    impute_scores = cross_val_score(estimator, X_missing, y_missing,
+                                    scoring='neg_mean_squared_error',
+                                    cv=N_SPLITS)
+    return impute_scores
+
 
 def get_results(dataset):
     X_full, y_full = dataset.data, dataset.target
     n_samples = X_full.shape[0]
     n_features = X_full.shape[1]
 
     # Estimate the score on the entire dataset, with no missing values
-    estimator = RandomForestRegressor(random_state=0, n_estimators=100)
-    full_scores = cross_val_score(estimator, X_full, y_full,
-                                  scoring='neg_mean_squared_error', cv=5)
+    full_scores = cross_val_score(REGRESSOR, X_full, y_full,
+                                  scoring='neg_mean_squared_error',
+                                  cv=N_SPLITS)
 
     # Add missing values in 75% of the lines
     missing_rate = 0.75
@@ -51,35 +65,27 @@ def get_results(dataset):
                                          dtype=np.bool)))
     rng.shuffle(missing_samples)
     missing_features = rng.randint(0, n_features, n_missing_samples)
-
-    # Estimate the score after replacing missing values by 0
     X_missing = X_full.copy()
     X_missing[np.where(missing_samples)[0], missing_features] = 0
     y_missing = y_full.copy()
-    estimator = RandomForestRegressor(random_state=0, n_estimators=100)
-    zero_impute_scores = cross_val_score(estimator, X_missing, y_missing,
-                                         scoring='neg_mean_squared_error',
-                                         cv=5)
+
+    # Estimate the score after replacing missing values by 0
+    imputer = SimpleImputer(missing_values=0,
+                            strategy='constant',
+                            fill_value=0)
+    zero_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
 
     # Estimate the score after imputation (mean strategy) of the missing values
-    X_missing = X_full.copy()
-    X_missing[np.where(missing_samples)[0], missing_features] = 0
-    y_missing = y_full.copy()
-    estimator = make_pipeline(
-        make_union(SimpleImputer(missing_values=0, strategy="mean"),
-                   MissingIndicator(missing_values=0)),
-        RandomForestRegressor(random_state=0, n_estimators=100))
-    mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
-                                         scoring='neg_mean_squared_error',
-                                         cv=5)
+    imputer = SimpleImputer(missing_values=0, strategy="mean")
+    mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
 
     # Estimate the score after iterative imputation of the missing values
-    estimator = make_pipeline(
-        make_union(IterativeImputer(missing_values=0, random_state=0),
-                   MissingIndicator(missing_values=0)),
-        RandomForestRegressor(random_state=0, n_estimators=100))
-    iterative_impute_scores = cross_val_score(estimator, X_missing, y_missing,
-                                              scoring='neg_mean_squared_error')
+    imputer = IterativeImputer(missing_values=0,
+                               random_state=0,
+                               n_nearest_features=5)
+    iterative_impute_scores = get_scores_for_imputer(imputer,
+                                                     X_missing,
+                                                     y_missing)
 
     return ((full_scores.mean(), full_scores.std()),
             (zero_impute_scores.mean(), zero_impute_scores.std()),

diff --git a/sklearn/impute.py b/sklearn/impute.py
@@ -556,7 +556,7 @@ def __init__(self,
                  initial_strategy="mean",
                  min_value=None,
                  max_value=None,
-                 verbose=False,
+                 verbose=0,
                  random_state=None):
 
         self.missing_values = missing_values