|
| 1 | +.. _model_persistence: |
| 2 | + |
| 3 | +================= |
| 4 | +Model persistence |
| 5 | +================= |
| 6 | + |
| 7 | +After training a scikit-learn model, it is desirable to have a way to persist |
| 8 | +the model for future use without having to retrain. The following section gives |
| 9 | +you an example of how to p
8000
ersist a model with pickle. We'll also review a few |
| 10 | +security and maintainability issues when working with pickle serialization. |
| 11 | + |
| 12 | + |
| 13 | +Persistence example |
| 14 | +------------------- |
| 15 | + |
| 16 | +It is possible to save a model in the scikit by using Python's built-in |
| 17 | +persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_:: |
| 18 | + |
| 19 | + >>> from sklearn import svm |
| 20 | + >>> from sklearn import datasets |
| 21 | + >>> clf = svm.SVC() |
| 22 | + >>> iris = datasets.load_iris() |
| 23 | + >>> X, y = iris.data, iris.target |
| 24 | + >>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE |
| 25 | + SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, |
| 26 | + kernel='rbf', max_iter=-1, probability=False, random_state=None, |
| 27 | + shrinking=True, tol=0.001, verbose=False) |
| 28 | + |
| 29 | + >>> import pickle |
| 30 | + >>> s = pickle.dumps(clf) |
| 31 | + >>> clf2 = pickle.loads(s) |
| 32 | + >>> clf2.predict(X[0]) |
| 33 | + array([0]) |
| 34 | + >>> y[0] |
| 35 | + 0 |
| 36 | + |
| 37 | +In the specific case of the scikit, it may be more interesting to use |
| 38 | +joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``), |
| 39 | +which is more efficient on objects that carry large numpy arrays internally as |
| 40 | +is often the case for fitted scikit-learn estimators, but can only pickle to the |
| 41 | +disk and not to a string:: |
| 42 | + |
| 43 | + >>> from sklearn.externals import joblib |
| 44 | + >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP |
| 45 | + |
| 46 | +Later you can load back the pickled model (possibly in another Python process) |
| 47 | +with:: |
| 48 | + |
| 49 | + >>> clf = joblib.load('filename.pkl') # doctest:+SKIP |
| 50 | + |
| 51 | +.. note:: |
| 52 | + |
| 53 | + joblib.dump returns a list of filenames. Each individual numpy array |
| 54 | + contained in the `clf` object is serialized as a separate file on the |
| 55 | + filesystem. All files are required in the same folder when reloading the |
| 56 | + model with joblib.load. |
| 57 | + |
| 58 | + |
| 59 | +Security & maintainability limitations |
| 60 | +-------------------------------------- |
| 61 | + |
| 62 | +pickle (and joblib by extension), has some issues regarding maintainability |
| 63 | +and security. Because of this, |
| 64 | + |
| 65 | +* Never unpickle untrusted data |
| 66 | +* Models saved in one version of scikit-learn might not load in another |
| 67 | + version. |
| 68 | + |
| 69 | +In order to rebuild a similar model with future versions of scikit-learn, |
| 70 | +additional metadata should be saved along the pickled model: |
| 71 | + |
| 72 | +* The training data, e.g. a reference to a immutable snapshot |
| 73 | +* The python source code used to generate the model |
| 74 | +* The versions of scikit-learn and its dependencies |
| 75 | +* The cross validation score obtained on the training data |
| 76 | + |
| 77 | +This should make it possible to check that the cross-validation score is in the |
| 78 | +same range as before. |
| 79 | + |
| 80 | +If you want to know more about these issues and explore other possible |
| 81 | +serialization methods, please refer to this |
| 82 | +`talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_. |
0 commit comments