8000 [MRG+3] ENH Caching Pipeline by memoizing transformer (#7990) · sergeyf/scikit-learn@68099a2 · GitHub
[go: up one dir, main page]

Skip to content

Commit 68099a2

Browse files
glemaitresergeyf
authored andcommitted
[MRG+3] ENH Caching Pipeline by memoizing transformer (scikit-learn#7990)
* ENH Caching Pipeline by memoizing transformer * Fix lesteve changes * Fix comments * Fix doc * Fix jnothman comments
1 parent a526c3c commit 68099a2

File tree

5 files changed

+323
-59
lines changed

5 files changed

+323
-59
lines changed

doc/modules/pipeline.rst

Lines changed: 107 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,10 @@ is an estimator object::
3939
>>> from sklearn.decomposition import PCA
4040
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
4141
>>> pipe = Pipeline(estimators)
42-
>>> pipe # doctest: +NORMALIZE_WHITESPACE
43-
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',
44-
n_components=None, random_state=None, svd_solver='auto', tol=0.0,
45-
whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None,
46-
coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto',
47-
kernel='rbf', max_iter=-1, probability=False, random_state=None,
48-
shrinking=True, tol=0.001, verbose=False))])
42+
>>> pipe # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
43+
Pipeline(memory=None,
44+
steps=[('reduce_dim', PCA(copy=True,...)),
45+
('clf', SVC(C=1.0,...))])
4946

5047
The utility function :func:`make_pipeline` is a shorthand
5148
for constructing pipelines;
@@ -56,7 +53,8 @@ filling in the names automatically::
5653
>>> from sklearn.naive_bayes import MultinomialNB
5754
>>> from sklearn.preprocessing import Binarizer
5855
>>> make_pipeline(Binarizer(), MultinomialNB()) # doctest: +NORMALIZE_WHITESPACE
59-
Pipeline(steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
56+
Pipeline(memory=None,
57+
steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
6058
('multinomialnb', MultinomialNB(alpha=1.0,
6159
class_prior=None,
6260
fit_prior=True))])
@@ -76,30 +74,26 @@ and as a ``dict`` in ``named_steps``::
7674
Parameters of the estimators in the pipeline can be accessed using the
7775
``<estimator>__<parameter>`` syntax::
7876

79-
>>> pipe.set_params(clf__C=10) # doctest: +NORMALIZE_WHITESPACE
80-
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',
81-
n_components=None, random_state=None, svd_solver='auto', tol=0.0,
82-
whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None,
83-
coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto',
84-
kernel='rbf', max_iter=-1, probability=False, random_state=None,
85-
shrinking=True, tol=0.001, verbose=False))])
86-
77+
>>> pipe.set_params(clf__C=10) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
78+
Pipeline(memory=None,
79+
steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',...)),
80+
('clf', SVC(C=10, cache_size=200, class_weight=None,...))])
8781

8882
This is particularly important for doing grid searches::
8983

9084
>>> from sklearn.model_selection import GridSearchCV
91-
>>> params = dict(reduce_dim__n_components=[2, 5, 10],
92-
... clf__C=[0.1, 10, 100])
93-
>>> grid_search = GridSearchCV(pipe, param_grid=params)
85+
>>> param_grid = dict(reduce_dim__n_components=[2, 5, 10],
86+
... clf__C=[0.1, 10, 100])
87+
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)
9488

9589
Individual steps may also be replaced as parameters, and non-final steps may be
9690
ignored by setting them to ``None``::
9791

9892
>>> from sklearn.linear_model import LogisticRegression
99-
>>> params = dict(reduce_dim=[None, PCA(5), PCA(10)],
100-
... clf=[SVC(), LogisticRegression()],
101-
... clf__C=[0.1, 10, 100])
102-
>>> grid_search = GridSearchCV(pipe, param_grid=params)
93+
>>> param_grid = dict(reduce_dim=[None, PCA(5), PCA(10)],
94+
... clf=[SVC(), LogisticRegression()],
95+
... clf__C=[0.1, 10, 100])
96+
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)
10397

10498
.. topic:: Examples:
10599

@@ -108,6 +102,7 @@ ignored by setting them to ``None``::
108102
* :ref:`sphx_glr_auto_examples_plot_digits_pipe.py`
109103
* :ref:`sphx_glr_auto_examples_plot_kernel_approximation.py`
110104
* :ref:`sphx_glr_auto_examples_svm_plot_svm_anova.py`
105+
* :ref:`sphx_glr_auto_examples_plot_compare_reduction.py`
111106

112107
.. topic:: See also:
113108

@@ -124,6 +119,84 @@ i.e. if the last estimator is a classifier, the :class:`Pipeline` can be used
124119
as a classifier. If the last estimator is a transformer, again, so is the
125120
pipeline.
126121

122+
Caching transformers: avoid repeated computation
123+
-------------------------------------------------
124+
125+
.. currentmodule:: sklearn.pipeline
126+
127+
Fitting transformers may be computationally expensive. With its
128+
``memory`` parameter set, :class:`Pipeline` will cache each transformer
129+
after calling ``fit``.
130+
This feature is used to avoid computing the fit transformers within a pipeline
131+
if the parameters and input data are identical. A typical example is the case of
132+
a grid search in which the transformers can be fitted only once and reused for
133+
each configuration.
134+
135+
The parameter ``memory`` is needed in order to cache the transformers.
136+
``memory`` can be either a string containing the directory where to cache the
137+
transformers or a `joblib.Memory <https://pythonhosted.org/joblib/memory.html>`_
138+
object::
139+
140+
>>> from tempfile import mkdtemp
141+
>>> from shutil import rmtree
142+
>>> from sklearn.decomposition import PCA
143+
>>> from sklearn.svm import SVC
144+
>>> from sklearn.pipeline import Pipeline
145+
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
146+
>>> cachedir = mkdtemp()
147+
>>> pipe = Pipeline(estimators, memory=cachedir)
148+
>>> pipe # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
149+
Pipeline(...,
150+
steps=[('reduce_dim', PCA(copy=True,...)),
151+
('clf', SVC(C=1.0,...))])
152+
>>> # Clear the cache directory when you don't need it anymore
153+
>>> rmtree(cachedir)
154+
155+
.. warning:: **Side effect of caching transfomers**
156+
157+
Using a :class:`Pipeline` without cache enabled, it is possible to
158+
inspect the original instance such as::
159+
160+
>>> from sklearn.datasets import load_digits
161+
>>> digits = load_digits()
162+
>>> pca1 = PCA()
163+
>>> svm1 = SVC()
164+
>>> pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
165+
>>> pipe.fit(digits.data, digits.target)
166+
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
167+
Pipeline(memory=None,
168+
steps=[('reduce_dim', PCA(...)), ('clf', SVC(...))])
169+
>>> # The pca instance can be inspected directly
170+
>>> print(pca1.components_) # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
171+
[[ -1.77484909e-19 ... 4.07058917e-18]]
172+
173+
Enabling caching triggers a clone of the transformers before fitting.
174+
Therefore, the transformer instance given to the pipeline cannot be
175+
inspected directly.
176+
In following example, accessing the :class:`PCA` instance ``pca2``
177+
will raise an ``AttributeError`` since ``pca2`` will be an unfitted
178+
transformer.
179+
Instead, use the attribute ``named_steps`` to inspect estimators within
180+
the pipeline::
181+
182+
>>> cachedir = mkdtemp()
183+
>>> pca2 = PCA()
184+
>>> svm2 = SVC()
185+
>>> cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
186+
... memory=cachedir)
187+
>>> cached_pipe.fit(digits.data, digits.target)
188+
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
189+
Pipeline(memory=...,
190+
steps=[('reduce_dim', PCA(...)), ('clf', SVC(...))])
191+
>>> print(cached_pipe.named_steps['reduce_dim'].components_)
192+
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
193+
[[ -1.77484909e-19 ... 4.07058917e-18]]
194+
>>> # Remove the cache directory
195+
>>> rmtree(cachedir)
196+
197+
.. topic:: Examples:
198+
199+
* :ref:`sphx_glr_auto_examples_plot_compare_reduction.py`
127200

128201
.. _feature_union:
129202

@@ -164,15 +237,11 @@ and ``value`` is an estimator object::
164237
>>> from sklearn.decomposition import KernelPCA
165238
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
166239
>>> combined = FeatureUnion(estimators)
167-
>>> combined # doctest: +NORMALIZE_WHITESPACE
168-
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
169-
iterated_power='auto', n_components=None, random_state=None,
170-
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca',
171-
KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3,
172-
eigen_solver='auto', fit_inverse_transform=False, gamma=None,
173-
kernel='linear', kernel_params=None, max_iter=None, n_components=None,
174-
n_jobs=1, random_state=None, remove_zero_eig=False, tol=0))],
175-
transformer_weights=None)
240+
>>> combined # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
241+
FeatureUnion(n_jobs=1,
242+
transformer_list=[('linear_pca', PCA(copy=True,...)),
243+
('kernel_pca', KernelPCA(alpha=1.0,...))],
244+
transformer_weights=None)
176245

177246

178247
Like pipelines, feature unions have a shorthand constructor called
@@ -182,11 +251,12 @@ Like pipelines, feature unions have a shorthand constructor called
182251
Like ``Pipeline``, individual steps may be replaced using ``set_params``,
183252
and ignored by setting to ``None``::
184253

185-
>>> combined.set_params(kernel_pca=None) # doctest: +NORMALIZE_WHITESPACE
186-
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
187-
iterated_power='auto', n_components=None, random_state=None,
188-
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],
189-
transformer_weights=None)
254+
>>> combined.set_params(kernel_pca=None)
255+
... # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
256+
FeatureUnion(n_jobs=1,
257+
transformer_list=[('linear_pca', PCA(copy=True,...)),
258+
('kernel_pca', None)],
259+
transformer_weights=None)
190260

191261
.. topic:: Examples:
192262

doc/whats_new.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,9 @@ Enhancements
5656
- :class:`multioutput.MultiOutputRegressor` and :class:`multioutput.MultiOutputClassifier`
5757
now support online learning using `partial_fit`.
5858
issue: `8053` by :user:`Peng Yu <yupbank>`.
59+
- :class:`pipeline.Pipeline` allows to cache transformers
60+
within a pipeline by using the ``memory`` constructor parameter.
61+
By :issue:`7990` by :user:`Guillaume Lemaitre <glemaitre>`.
5962

6063
- :class:`decomposition.PCA`, :class:`decomposition.IncrementalPCA` and
6164
:class:`decomposition.TruncatedSVD` now expose the singular values

examples/plot_compare_reduction.py

100644100755
Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#!/usr/bin/python
1+
#!/usr/bin/env python
22
# -*- coding: utf-8 -*-
33
"""
44
=================================================================
@@ -7,13 +7,27 @@
77
88
This example constructs a pipeline that does dimensionality
99
reduction followed by prediction with a support vector
10-
classifier. It demonstrates the use of GridSearchCV and
11-
Pipeline to optimize over different classes of estimators in a
12-
single CV run -- unsupervised PCA and NMF dimensionality
10+
classifier. It demonstrates the use of ``GridSearchCV`` and
11+
``Pipeline`` to optimize over different classes of estimators in a
12+
single CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality
1313
reductions are compared to univariate feature selection during
1414
the grid search.
15+
16+
Additionally, ``Pipeline`` can be instantiated with the ``memory``
17+
argument to memoize the transformers within the pipeline, avoiding to fit
18+
agai 1241 n the same transformers over and over.
19+
20+
Note that the use of ``memory`` to enable caching becomes interesting when the
21+
fitting of a transformer is costly.
1522
"""
16-
# Authors: Robert McGibbon, Joel Nothman
23+
24+
###############################################################################
25+
# Illustration of ``Pipeline`` and ``GridSearchCV``
26+
###############################################################################
27+
# This section illustrates the use of a ``Pipeline`` with
28+
# ``GridSearchCV``
29+
30+
# Authors: Robert McGibbon, Joel Nothman, Guillaume Lemaitre
1731

1832
from __future__ import print_function, division
1933

@@ -49,7 +63,7 @@
4963
]
5064
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
5165

52-
grid = GridSearchCV(pipe, cv=3, n_jobs=2, param_grid=param_grid)
66+
grid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)
5367
digits = load_digits()
5468
grid.fit(digits.data, digits.target)
5569

@@ -72,4 +86,45 @@
7286
plt.ylabel('Digit classification accuracy')
7387
plt.ylim((0, 1))
7488
plt.legend(loc='upper left')
89+
90+
###############################################################################
91+
# Caching transformers within a ``Pipeline``
92+
###############################################################################
93+
# It is sometimes worthwhile storing the state of a specific transformer
94+
# since it could be used again. Using a pipeline in ``GridSearchCV`` triggers
95+
# such situations. Therefore, we use the argument ``memory`` to enable caching.
96+
#
97+
# .. warning::
98+
# Note that this example is, however, only an illustration since for this
99+
# specific case fitting PCA is not necessarily slower than loading the
100+
# cache. Hence, use the ``memory`` constructor parameter when the fitting
101+
# of a transformer is costly.
102+
103+
from tempfile import mkdtemp
104+
from shutil import rmtree
105+
from sklearn.externals.joblib import Memory
106+
107+
# Create a temporary folder to store the transformers of the pipeline
108+
cachedir = mkdtemp()
109+
memory = Memory(cachedir=cachedir, verbose=10)
110+
cached_pipe = Pipeline([('reduce_dim', PCA()),
111+
('classify', LinearSVC())],
112+
memory=memory)
113+
114+
# This time, a cached pipeline will be used within the grid search
115+
grid = GridSearchCV(cached_pipe, cv=3, n_jobs=1, param_grid=param_grid)
116+
digits = load_digits()
117+
grid.fit(digits.data, digits.target)
118+
119+
# Delete the temporary cache before exiting
120+
rmtree(cachedir)
121+
122+
###############################################################################
123+
# The ``PCA`` fitting is only computed at the evaluation of the first
124+
# configuration of the ``C`` parameter of the ``LinearSVC`` classifier. The
125+
# other configurations of ``C`` will trigger the loading of the cached ``PCA``
126+
# estimator data, leading to save processing time. Therefore, the use of
127+
# caching the pipeline using ``memory`` is highly beneficial when fitting
128+
# a transformer is costly.
129+
75130
plt.show()

0 commit comments

Comments
 (0)
0