8000 Merge pull request #3317 from pignacio/model_persistence_doc · scikit-learn/scikit-learn@6460479 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6460479

Browse files
committed
Merge pull request #3317 from pignacio/model_persistence_doc
DOC Documentation for model persistence
2 parents afcb384 + 4b79f22 commit 6460479

File tree

3 files changed

+99
-0
lines changed

3 files changed

+99
-0
lines changed

doc/model_selection.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ Model selection and evaluation
1111
modules/grid_search
1212
modules/pipeline
1313
modules/model_evaluation
14+
modules/model_persistence
1415
modules/learning_curve

doc/modules/model_persistence.rst

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
.. _model_persistence:
2+
3+
=================
4+
Model persistence
5+
=================
6+
7+
After training a scikit-learn model, it is desirable to have a way to persist
8+
the model for future use without having to retrain. The following section gives
9+
you an example of how to p 8000 ersist a model with pickle. We'll also review a few
10+
security and maintainability issues when working with pickle serialization.
11+
12+
13+
Persistence example
14+
-------------------
15+
16+
It is possible to save a model in the scikit by using Python's built-in
17+
persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
18+
19+
>>> from sklearn import svm
20+
>>> from sklearn import datasets
21+
>>> clf = svm.SVC()
22+
>>> iris = datasets.load_iris()
23+
>>> X, y = iris.data, iris.target
24+
>>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE
25+
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
26+
kernel='rbf', max_iter=-1, probability=False, random_state=None,
27+
shrinking=True, tol=0.001, verbose=False)
28+
29+
>>> import pickle
30+
>>> s = pickle.dumps(clf)
31+
>>> clf2 = pickle.loads(s)
32+
>>> clf2.predict(X[0])
33+
array([0])
34+
>>> y[0]
35+
0
36+
37+
In the specific case of the scikit, it may be more interesting to use
38+
joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
39+
which is more efficient on objects that carry large numpy arrays internally as
40+
is often the case for fitted scikit-learn estimators, but can only pickle to the
41+
disk and not to a string::
42+
43+
>>> from sklearn.externals import joblib
44+
>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
45+
46+
Later you can load back the pickled model (possibly in another Python process)
47+
with::
48+
49+
>>> clf = joblib.load('filename.pkl') # doctest:+SKIP
50+
51+
.. note::
52+
53+
joblib.dump returns a list of filenames. Each individual numpy array
54+
contained in the `clf` object is serialized as a separate file on the
55+
filesystem. All files are required in the same folder when reloading the
56+
model with joblib.load.
57+
58+
59+
Security & maintainability limitations
60+
--------------------------------------
61+
62+
pickle (and joblib by extension), has some issues regarding maintainability
63+
and security. Because of this,
64+
65+
* Never unpickle untrusted data
66+
* Models saved in one version of scikit-learn might not load in another
67+
version.
68+
69+
In order to rebuild a similar model with future versions of scikit-learn,
70+
additional metadata should be saved along the pickled model:
71+
72+
* The training data, e.g. a reference to a immutable snapshot
73+
* The python source code used to generate the model
74+
* The versions of scikit-learn and its dependencies
75+
* The cross validation score obtained on the training data
76+
77+
This should make it possible to check that the cross-validation score is in the
78+
same range as before.
79+
80+
If you want to know more about these issues and explore other possible
81+
serialization methods, please refer to this
82+
`talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.

doc/tutorial/basic/tutorial.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,4 +233,20 @@ and not to a string::
233233

234234
>>> from sklearn.externals import joblib
235235
>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
236+
237+
Later you can load back the pickled model (possibly in another Python process)
238+
with::
239+
240+
>>> clf = joblib.load('filename.pkl') # doctest:+SKIP
241+
242+
.. note::
243+
244+
joblib.dump returns a list of filenames. Each individual numpy array
245+
contained in the `clf` object is serialized as a separate file on the
246+
filesystem. All files are required in the same folder when reloading the
247+
model with joblib.load.
248+
249+
Note that pickle has some security and maintainability issues. Please refer to
250+
section :ref:`model_persistence` for more detailed information about model
251+
persistence with scikit-learn.
236252

0 commit comments

Comments
 (0)
0