8000 DOC Added inconsistent preprocessing to pitfalls (#17114) · thomasjpfan/scikit-learn@41fd8aa · GitHub
[go: up one dir, main page]

Skip to content

Commit 41fd8aa

Browse files
authored
DOC Added inconsistent preprocessing to pitfalls (scikit-learn#17114)
1 parent 8091faf commit 41fd8aa

File tree

1 file changed

+56
-1
lines changed

1 file changed

+56
-1
lines changed

doc/common_pitfalls.rst

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,61 @@ anti-patterns that occur when using scikit-learn. It provides
1515
examples of what **not** to do, along with a corresponding correct
1616
example.
1717

18+
Inconsistent preprocessing
19+
==========================
20+
21+
scikit-learn provides a library of :ref:`data-transforms`, which
22+
may clean (see :ref:`preprocessing`), reduce
23+
(see :ref:`data_reduction`), expand (see :ref:`kernel_approximation`)
24+
or generate (see :ref:`feature_extraction`) feature representations.
25+
If these data transforms are used when training a model, they also
26+
must be used on subsequent datasets, whether it's test data or
27+
data in a production system. Otherwise, the feature space will change,
28+
and the model will not be able to perform effectively.
29+
30+
For the following example, let's create a synthetic dataset with a
31+
single feature::
32+
33+
34+
35+
from sklearn.datasets import make_regression
36+
>>> from sklearn.model_selection import train_test_split
37+
...
38+
>>> random_state = 42
39+
>>> X, y = make_regression(random_state=random_state, n_features=1, noise=1)
40+
>>> X_train, X_test, y_train, y_test = train_test_split(
41+
... X, y, test_size=0.4, random_state=random_state)
42+
43+
**Wrong**
44+
45+
The train dataset is scaled, but not the test dataset, so model
46+
performance on the test dataset is worse than expected::
47+
48+
>>> from sklearn.metrics import mean_squared_error
49+
>>> from sklearn.linear_model import LinearRegression
50+
>>> from sklearn.preprocessing import StandardScaler
51+
...
52+
>>> scaler = StandardScaler()
53+
>>> X_train_transformed = scaler.fit_transform(X_train)
54+
>>> model = LinearRegression().fit(X_train_transformed, y_train)
55+
>>> mean_squared_error(y_test, model.predict(X_test))
56+
62.80...
57+
58+
**Right**
59+
60+
A :class:`Pipeline <sklearn.pipeline.Pipeline>` makes it easier to chain
61+
transformations with estimators, and reduces the possibility of
62+
forgetting a transformation::
63+
64+
>>> from sklearn.pipeline import make_pipeline
65+
...
66+
>>> model = make_pipeline(StandardScaler(), LinearRegression())
67+
>>> model.fit(X_train, y_train)
68+
Pipeline(steps=[('standardscaler', StandardScaler()),
69+
('linearregression', LinearRegression())])
70+
>>> mean_squared_error(y_test, model.predict(X_test))
71+
0.90...
72+
1873
.. _data_leakage:
1974

2075
Data leakage
@@ -170,4 +225,4 @@ Below are some tips on avoiding data leakage:
170225
* The scikit-learn :ref:`pipeline <pipeline>` is a great way to prevent data
171226
leakage as it ensures that the appropriate method is performed on the
172227
correct data subset. The pipeline is ideal for use in cross-validation
173-
and hyper-parameter tuning functions.
228+
and hyper-parameter tuning functions.

0 commit comments

Comments
 (0)
0