DOC Added inconsistent preprocessing to pitfalls (scikit-learn#17114)

skeller88 · web-flow · commit 41fd8aa92e5e · 2020-10-14T23:54:01.000+11:00
diff --git a/doc/common_pitfalls.rst b/doc/common_pitfalls.rst
@@ -15,6 +15,61 @@ anti-patterns that occur when using scikit-learn. It provides
 examples of what **not** to do, along with a corresponding correct
 example.
 
+Inconsistent preprocessing
+==========================
+
+scikit-learn provides a library of :ref:`data-transforms`, which
+may clean (see :ref:`preprocessing`), reduce
+(see :ref:`data_reduction`), expand (see :ref:`kernel_approximation`)
+or generate (see :ref:`feature_extraction`) feature representations.
+If these data transforms are used when training a model, they also
+must be used on subsequent datasets, whether it's test data or
+data in a production system. Otherwise, the feature space will change,
+and the model will not be able to perform effectively.
+
+For the following example, let's create a synthetic dataset with a
+single feature::
+
+    
+    
+    from sklearn.datasets import make_regression
+    >>> from sklearn.model_selection import train_test_split
+    ...
+    >>> random_state = 42
+    >>> X, y = make_regression(random_state=random_state, n_features=1, noise=1)
+    >>> X_train, X_test, y_train, y_test = train_test_split(
+    ... X, y, test_size=0.4, random_state=random_state)
+
+**Wrong**
+
+The train dataset is scaled, but not the test dataset, so model
+performance on the test dataset is worse than expected::
+
+    >>> from sklearn.metrics import mean_squared_error
+    >>> from sklearn.linear_model import LinearRegression
+    >>> from sklearn.preprocessing import StandardScaler
+    ...
+    >>> scaler = StandardScaler()
+    >>> X_train_transformed = scaler.fit_transform(X_train)
+    >>> model = LinearRegression().fit(X_train_transformed, y_train)
+    >>> mean_squared_error(y_test, model.predict(X_test))
+    62.80...
+
+**Right**
+
+A :class:`Pipeline <sklearn.pipeline.Pipeline>` makes it easier to chain
+transformations with estimators, and reduces the possibility of
+forgetting a transformation::
+
+    >>> from sklearn.pipeline import make_pipeline
+    ...
+    >>> model = make_pipeline(StandardScaler(), LinearRegression())
+    >>> model.fit(X_train, y_train)
+    Pipeline(steps=[('standardscaler', StandardScaler()),
+                    ('linearregression', LinearRegression())])
+    >>> mean_squared_error(y_test, model.predict(X_test))
+    0.90...
+
 .. _data_leakage:
 
 Data leakage
@@ -170,4 +225,4 @@ Below are some tips on avoiding data leakage:
 * The scikit-learn :ref:`pipeline <pipeline>` is a great way to prevent data
   leakage as it ensures that the appropriate method is performed on the
   correct data subset. The pipeline is ideal for use in cross-validation
-  and hyper-parameter tuning functions.
+  and hyper-parameter tuning functions.