Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Selecting dimensionality reduction with Pipeline and GridSearchCV#
This example constructs a pipeline that does dimensionality
reduction followed by prediction with a support vector
classifier. It demonstrates the use of GridSearchCV
and
Pipeline
to optimize over different classes of estimators in a
single CV run – unsupervised PCA
and NMF
dimensionality
reductions are compared to univariate feature selection during
the grid search.
Additionally, Pipeline
can be instantiated with the memory
argument to memoize the transformers within the pipeline, avoiding to fit
again the same transformers over and over.
Note that the use of memory
to enable caching becomes interesting when the
fitting of a transformer is costly.
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
Illustration of Pipeline
and GridSearchCV
#
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import NMF, PCA
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import LinearSVC
X, y = load_digits(return_X_y=True)
pipe = Pipeline(
[
("scaling", MinMaxScaler()),
# the reduce_dim stage is populated by the param_grid
("reduce_dim", "passthrough"),
("classify", LinearSVC(dual=False, max_iter=10000)),
]
)
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
"reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
"reduce_dim__n_components": N_FEATURES_OPTIONS,
"classify__C": C_OPTIONS,
},
{
"reduce_dim": [SelectKBest(mutual_info_classif)],
"reduce_dim__k": N_FEATURES_OPTIONS,
"classify__C": C_OPTIONS,
},
]
reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"]
grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
grid.fit(X, y)