8000 DOC adds dropdown for 10.3 Controlling Randomness by lazarust · Pull Request #26946 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC adds dropdown for 10.3 Controlling Randomness #26946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 24 additions & 19 deletions doc/common_pitfalls.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,26 @@ be the average of the train subset, **not** the average of all the data. If the
test subset is included in the average calculation, information from the test
subset is influencing the model.

How to avoid data leakage
-------------------------

Below are some tips on avoiding data leakage:

* Always split the data into train and test subsets first, particularly
before any preprocessing steps.
* Never include test data when using the `fit` and `fit_transform`
methods. Using all the data, e.g., `fit(X)`, can result in overly optimistic
scores.

Conversely, the `transform` method should be used on both train and test
subsets as the same preprocessing should be applied to all the data.
This can be achieved by using `fit_transform` on the train subset and
`transform` on the test subset.
* The scikit-learn :ref:`pipeline <pipeline>` is a great way to prevent data
leakage as it ensures that the appropriate method is performed on the
correct data subset. The pipeline is ideal for use in cross-validation
and hyper-parameter tuning functions.

An example of data leakage during preprocessing is detailed below.

Data leakage during pre-processing
Expand Down Expand Up @@ -213,25 +233,6 @@ method is used during fitting and predicting::
>>> print(f"Mean accuracy: {scores.mean():.2f}+/-{scores.std():.2f}")
Mean accuracy: 0.46+/-0.07

How to avoid data leakage
-------------------------

Below are some tips on avoiding data leakage:

* Always split the data into train and test subsets first, particularly
before any preprocessing steps.
* Never include test data when using the `fit` and `fit_transform`
methods. Using all the data, e.g., `fit(X)`, can result in overly optimistic
scores.

Conversely, the `transform` method should be used on both train and test
subsets as the same preprocessing should be applied to all the data.
This can be achieved by using `fit_transform` on the train subset and
`transform` on the test subset.
* The scikit-learn :ref:`pipeline <pipeline>` is a great way to prevent data
leakage as it ensures that the appropriate method is performed on the
correct data subset. The pipeline is ideal for use in cross-validation
and hyper-parameter tuning functions.

.. _randomness:

Expand Down Expand Up @@ -413,7 +414,9 @@ it will allow the estimator RNG to vary for each fold.
illustration purpose: what matters is what we pass to the
:class:`~sklearn.ensemble.RandomForestClassifier` estimator.

|details-start|
**Cloning**
|details-split|

Another subtle side effect of passing `RandomState` instances is how
:func:`~sklearn.base.clone` will work::
Expand Down Expand Up @@ -447,6 +450,8 @@ influence each other.
:class:`~sklearn.ensemble.StackingClassifier`,
:class:`~sklearn.calibration.CalibratedClassifierCV`, etc.).

|details-end|

CV splitters
............

Expand Down
0