From 8489c33010f35302b515b0859db485b209dc8b15 Mon Sep 17 00:00:00 2001 From: Raul Garreta Date: Thu, 17 Apr 2014 16:30:05 -0400 Subject: [PATCH 1/6] added a new section on model persistence --- doc/model_persistence.rst | 85 +++++++++++++++++++++++++++++++++ doc/tutorial/basic/tutorial.rst | 4 ++ doc/user_guide.rst | 1 + 3 files changed, 90 insertions(+) create mode 100644 doc/model_persistence.rst diff --git a/doc/model_persistence.rst b/doc/model_persistence.rst new file mode 100644 index 0000000000000..7aabc4c4cf487 --- /dev/null +++ b/doc/model_persistence.rst @@ -0,0 +1,85 @@ +.. _model_persistence: + +================= +Model persistence +================= + +After training a scikit-learn model, it is desirable to have a way to persist +the model for future use without having to retrain. The following section gives +you an example of how to persist a model with pickle. We'll also review a few +security and maintainability issues when working with pickle serialization. + + +Persistence example +------------------- + +It is possible to save a model in the scikit by using Python's built-in +persistence model, namely `pickle `_:: + + >>> from sklearn import svm + >>> from sklearn import datasets + >>> clf = svm.SVC() + >>> iris = datasets.load_iris() + >>> X, y = iris.data, iris.target + >>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE + SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, + kernel='rbf', max_iter=-1, probability=False, random_state=None, + shrinking=True, tol=0.001, verbose=False) + + >>> import pickle + >>> s = pickle.dumps(clf) + >>> clf2 = pickle.loads(s) + >>> clf2.predict(X[0]) + array([0]) + >>> y[0] + 0 + +In the specific case of the scikit, it may be more interesting to use +joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``), +which is more efficient on big data, but can only pickle to the disk +and not to a string:: + + >>> from sklearn.externals import joblib + >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP + + +Security & maintainability limitations +-------------------------------------- + +You must be aware that pickle has some issues regarding maintainability and +security. From the **maintainability** point of view, you should take care the +issues that may arise if you upgrade your sklearn library while still loading a +model that was trained with a previous version, the model may have a code +structure that could not be compatible with newer versions and thus, don't work. +The same issue could also happen if you upgrade numpy or scipy versions. + +A good practice is to save the scikit-learn, numpy and scipy versions to know +exactly what versions have been used to generate the model. You can do that, for +example, by executing a ``pip freeze`` command and saving the output to a text +file which should be stored together with your pickles. +Also, save a snapshot of your data to make it possible to retrain the model +if incompatibility issues arise when upgrading the libraries. + +Regarding **security** issues, you may know that pickle is implemented with a +stack machine that executes instructions. As a difference with other +serialization methods like JSON, BSON, YAML, etc, which are all data oriented, +pickle is instruction oriented. Pickle serializes objects by persisting a set of +instructions that will be then executed at deserialization time in order to +reconstruct your objects. In fact, as part of the deserialization process, +pickle could call any arbitrary function, which opens up security +vulnerabilities against any malicious data or exploits. + +Here is the warning from the official pickle documentation: + +.. warning:: + + The pickle module is not intended to be secure against erroneous or + maliciously constructed data. Never unpickle data received from an untrusted + or unauthenticated source. + +If you want to know more about these issues and explore other possible +serialization methods, please refer to this +`talk by Alex Gaynor `_. + + + \ No newline at end of file diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst index 14630ff837a9d..784bf6f2911bd 100644 --- a/doc/tutorial/basic/tutorial.rst +++ b/doc/tutorial/basic/tutorial.rst @@ -234,3 +234,7 @@ and not to a string:: >>> from sklearn.externals import joblib >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP +It's important for you to know that pickle has some security and maintainability +issues. Please refer to section :ref:`model_persistence` for more detailed +information about model persistence with scikit-learn. + diff --git a/doc/user_guide.rst b/doc/user_guide.rst index 83f749c1981b1..0e66747232b8e 100644 --- a/doc/user_guide.rst +++ b/doc/user_guide.rst @@ -22,3 +22,4 @@ Dataset loading utilities modules/scaling_strategies.rst modules/computational_performance.rst + model_persistence.rst From df27e261566384898291de23a99faaa6a9546946 Mon Sep 17 00:00:00 2001 From: Raul Garreta Date: Thu, 17 Apr 2014 16:59:37 -0400 Subject: [PATCH 2/6] model persistence doc, added improvements from ogrisel comments --- doc/model_persistence.rst | 17 +++++++++++++++-- doc/tutorial/basic/tutorial.rst | 20 ++++++++++++++++---- 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/doc/model_persistence.rst b/doc/model_persistence.rst index 7aabc4c4cf487..f45e06d000144 100644 --- a/doc/model_persistence.rst +++ b/doc/model_persistence.rst @@ -36,11 +36,24 @@ persistence model, namely `pickle `_ In the specific case of the scikit, it may be more interesting to use joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``), -which is more efficient on big data, but can only pickle to the disk -and not to a string:: +which is more efficient on objects that carry large numpy arrays internally as +is often the case for fitted scikit-learn estimators, but can only pickle to the +disk and not to a string:: >>> from sklearn.externals import joblib >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP + +Later you can load back the pickled model (possibly in another Python process) +with:: + + >>> clf = joblib.load('filename.pkl') # doctest:+SKIP + +.. note:: + + joblib.dump returns a list of filenames. Each individual numpy array + contained in the `clf` object is serialized as a separate file on the + filesystem. All files are required in the same folder when reloading the + model with joblib.load. Security & maintainability limitations diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst index 784bf6f2911bd..10685485bbc91 100644 --- a/doc/tutorial/basic/tutorial.rst +++ b/doc/tutorial/basic/tutorial.rst @@ -233,8 +233,20 @@ and not to a string:: >>> from sklearn.externals import joblib >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP - -It's important for you to know that pickle has some security and maintainability -issues. Please refer to section :ref:`model_persistence` for more detailed -information about model persistence with scikit-learn. + +Later you can load back the pickled model (possibly in another Python process) +with:: + + >>> clf = joblib.load('filename.pkl') # doctest:+SKIP + +.. note:: + + joblib.dump returns a list of filenames. Each individual numpy array + contained in the `clf` object is serialized as a separate file on the + filesystem. All files are required in the same folder when reloading the + model with joblib.load. + +Note that pickle has some security and maintainability issues. Please refer to +section :ref:`model_persistence` for more detailed information about model +persistence with scikit-learn. From be315ed0a6f6c4e51e525ff0caae869619b9faea Mon Sep 17 00:00:00 2001 From: Ignacio Rossi Date: Wed, 25 Jun 2014 21:22:03 -0300 Subject: [PATCH 3/6] Move model persistence doc inside model selection section --- doc/model_selection.rst | 1 + doc/{ => modules}/model_persistence.rst | 0 doc/user_guide.rst | 1 - 3 files changed, 1 insertion(+), 1 deletion(-) rename doc/{ => modules}/model_persistence.rst (100%) diff --git a/doc/model_selection.rst b/doc/model_selection.rst index f54e0303d85a4..0e1d6e8ade04c 100644 --- a/doc/model_selection.rst +++ b/doc/model_selection.rst @@ -11,4 +11,5 @@ Model selection and evaluation modules/grid_search modules/pipeline modules/model_evaluation + modules/model_persistence modules/learning_curve diff --git a/doc/model_persistence.rst b/doc/modules/model_persistence.rst similarity index 100% rename from doc/model_persistence.rst rename to doc/modules/model_persistence.rst diff --git a/doc/user_guide.rst b/doc/user_guide.rst index 0e66747232b8e..83f749c1981b1 100644 --- a/doc/user_guide.rst +++ b/doc/user_guide.rst @@ -22,4 +22,3 @@ Dataset loading utilities modules/scaling_strategies.rst modules/computational_performance.rst - model_persistence.rst From 4c51809103c0b16825cfe22e27b90b69d58e9433 Mon Sep 17 00:00:00 2001 From: Ignacio Rossi Date: Wed, 25 Jun 2014 22:32:50 -0300 Subject: [PATCH 4/6] Remove trailing whitespace --- doc/modules/model_persistence.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index f45e06d000144..83b7a38cb295d 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -42,10 +42,10 @@ disk and not to a string:: >>> from sklearn.externals import joblib >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP - + Later you can load back the pickled model (possibly in another Python process) with:: - + >>> clf = joblib.load('filename.pkl') # doctest:+SKIP .. note:: @@ -89,10 +89,10 @@ Here is the warning from the official pickle documentation: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source. - + If you want to know more about these issues and explore other possible serialization methods, please refer to this -`talk by Alex Gaynor `_. - - - \ No newline at end of file +`talk by Alex Gaynor `_. + + + From 78c0f61060047886ed1ed4fbf23b4398c79201da Mon Sep 17 00:00:00 2001 From: Ignacio Rossi Date: Wed, 25 Jun 2014 23:35:48 -0300 Subject: [PATCH 5/6] Simplified security and maintenance section --- doc/modules/model_persistence.rst | 39 +++++-------------------------- 1 file changed, 6 insertions(+), 33 deletions(-) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index 83b7a38cb295d..09522292db4c5 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -59,40 +59,13 @@ with:: Security & maintainability limitations -------------------------------------- -You must be aware that pickle has some issues regarding maintainability and -security. From the **maintainability** point of view, you should take care the -issues that may arise if you upgrade your sklearn library while still loading a -model that was trained with a previous version, the model may have a code -structure that could not be compatible with newer versions and thus, don't work. -The same issue could also happen if you upgrade numpy or scipy versions. - -A good practice is to save the scikit-learn, numpy and scipy versions to know -exactly what versions have been used to generate the model. You can do that, for -example, by executing a ``pip freeze`` command and saving the output to a text -file which should be stored together with your pickles. -Also, save a snapshot of your data to make it possible to retrain the model -if incompatibility issues arise when upgrading the libraries. - -Regarding **security** issues, you may know that pickle is implemented with a -stack machine that executes instructions. As a difference with other -serialization methods like JSON, BSON, YAML, etc, which are all data oriented, -pickle is instruction oriented. Pickle serializes objects by persisting a set of -instructions that will be then executed at deserialization time in order to -reconstruct your objects. In fact, as part of the deserialization process, -pickle could call any arbitrary function, which opens up security -vulnerabilities against any malicious data or exploits. - -Here is the warning from the official pickle documentation: - -.. warning:: - - The pickle module is not intended to be secure against erroneous or - maliciously constructed data. Never unpickle data received from an untrusted - or unauthenticated source. +pickle (and joblib by extension), has some issues regarding maintainability +and security. Because of this, + +* Never unpickle untrusted data +* Models saved in one version of scikit-learn might not load in another + version. If you want to know more about these issues and explore other possible serialization methods, please refer to this `talk by Alex Gaynor `_. - - - From 4b79f220a405378f27f8d0fc6102388ba7420152 Mon Sep 17 00:00:00 2001 From: Ignacio Rossi Date: Thu, 26 Jun 2014 12:31:24 -0300 Subject: [PATCH 6/6] Metadata information for unpickling models in future versions --- doc/modules/model_persistence.rst | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst index 09522292db4c5..629df68cca9c0 100644 --- a/doc/modules/model_persistence.rst +++ b/doc/modules/model_persistence.rst @@ -66,6 +66,17 @@ and security. Because of this, * Models saved in one version of scikit-learn might not load in another version. +In order to rebuild a similar model with future versions of scikit-learn, +additional metadata should be saved along the pickled model: + +* The training data, e.g. a reference to a immutable snapshot +* The python source code used to generate the model +* The versions of scikit-learn and its dependencies +* The cross validation score obtained on the training data + +This should make it possible to check that the cross-validation score is in the +same range as before. + If you want to know more about these issues and explore other possible serialization methods, please refer to this `talk by Alex Gaynor `_.