@@ -15,6 +15,61 @@ anti-patterns that occur when using scikit-learn. It provides
15
15
examples of what **not ** to do, along with a corresponding correct
16
16
example.
17
17
18
+ Inconsistent preprocessing
19
+ ==========================
20
+
21
+ scikit-learn provides a library of :ref: `data-transforms `, which
22
+ may clean (see :ref: `preprocessing `), reduce
23
+ (see :ref: `data_reduction `), expand (see :ref: `kernel_approximation `)
24
+ or generate (see :ref: `feature_extraction `) feature representations.
25
+ If these data transforms are used when training a model, they also
26
+ must be used on subsequent datasets, whether it's test data or
27
+ data in a production system. Otherwise, the feature space will change,
28
+ and the model will not be able to perform effectively.
29
+
30
+ For the following example, let's create a synthetic dataset with a
31
+ single feature::
32
+
33
+
34
+
35
+ from sklearn.datasets import make_regression
36
+ >>> from sklearn.model_selection import train_test_split
37
+ ...
38
+ >>> random_state = 42
39
+ >>> X, y = make_regression(random_state=random_state, n_features=1, noise=1)
40
+ >>> X_train, X_test, y_train, y_test = train_test_split(
41
+ ... X, y, test_size=0.4, random_state=random_state)
42
+
43
+ **Wrong **
44
+
45
+ The train dataset is scaled, but not the test dataset, so model
46
+ performance on the test dataset is worse than expected::
47
+
48
+ >>> from sklearn.metrics import mean_squared_error
49
+ >>> from sklearn.linear_model import LinearRegression
50
+ >>> from sklearn.preprocessing import StandardScaler
51
+ ...
52
+ >>> scaler = StandardScaler()
53
+ >>> X_train_transformed = scaler.fit_transform(X_train)
54
+ >>> model = LinearRegression().fit(X_train_transformed, y_train)
55
+ >>> mean_squared_error(y_test, model.predict(X_test))
56
+ 62.80...
57
+
58
+ **Right **
59
+
60
+ A :class: `Pipeline <sklearn.pipeline.Pipeline> ` makes it easier to chain
61
+ transformations with estimators, and reduces the possibility of
62
+ forgetting a transformation::
63
+
64
+ >>> from sklearn.pipeline import make_pipeline
65
+ ...
66
+ >>> model = make_pipeline(StandardScaler(), LinearRegression())
67
+ >>> model.fit(X_train, y_train)
68
+ Pipeline(steps=[('standardscaler', StandardScaler()),
69
+ ('linearregression', LinearRegression())])
70
+ >>> mean_squared_error(y_test, model.predict(X_test))
71
+ 0.90...
72
+
18
73
.. _data_leakage :
19
74
20
75
Data leakage
@@ -170,4 +225,4 @@ Below are some tips on avoiding data leakage:
170
225
* The scikit-learn :ref: `pipeline <pipeline >` is a great way to prevent data
171
226
leakage as it ensures that the appropriate method is performed on the
172
227
correct data subset. The pipeline is ideal for use in cross-validation
173
- and hyper-parameter tuning functions.
228
+ and hyper-parameter tuning functions.
0 commit comments