From 8489c33010f35302b515b0859db485b209dc8b15 Mon Sep 17 00:00:00 2001
From: Raul Garreta <raul@tryolabs.com>
Date: Thu, 17 Apr 2014 16:30:05 -0400
Subject: [PATCH 1/6] added a new section on model persistence

---
 doc/model_persistence.rst       | 85 +++++++++++++++++++++++++++++++++
 doc/tutorial/basic/tutorial.rst |  4 ++
 doc/user_guide.rst              |  1 +
 3 files changed, 90 insertions(+)
 create mode 100644 doc/model_persistence.rst

diff --git a/doc/model_persistence.rst b/doc/model_persistence.rst
new file mode 100644
index 0000000000000..7aabc4c4cf487
--- /dev/null
+++ b/doc/model_persistence.rst
@@ -0,0 +1,85 @@
+.. _model_persistence:
+
+=================
+Model persistence
+=================
+
+After training a scikit-learn model, it is desirable to have a way to persist
+the model for future use without having to retrain. The following section gives
+you an example of how to persist a model with pickle. We'll also review a few
+security and maintainability issues when working with pickle serialization.
+
+
+Persistence example
+-------------------
+
+It is possible to save a model in the scikit by using Python's built-in
+persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
+
+  >>> from sklearn import svm
+  >>> from sklearn import datasets
+  >>> clf = svm.SVC()
+  >>> iris = datasets.load_iris()
+  >>> X, y = iris.data, iris.target
+  >>> clf.fit(X, y)  # doctest: +NORMALIZE_WHITESPACE
+  SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
+    kernel='rbf', max_iter=-1, probability=False, random_state=None,
+    shrinking=True, tol=0.001, verbose=False)
+
+  >>> import pickle
+  >>> s = pickle.dumps(clf)
+  >>> clf2 = pickle.loads(s)
+  >>> clf2.predict(X[0])
+  array([0])
+  >>> y[0]
+  0
+
+In the specific case of the scikit, it may be more interesting to use
+joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
+which is more efficient on big data, but can only pickle to the disk
+and not to a string::
+
+  >>> from sklearn.externals import joblib
+  >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
+
+
+Security & maintainability limitations
+--------------------------------------
+
+You must be aware that pickle has some issues regarding maintainability and
+security. From the **maintainability** point of view, you should take care the
+issues that may arise if you upgrade your sklearn library while still loading a
+model that was trained with a previous version, the model may have a code
+structure that could not be compatible with newer versions and thus, don't work.
+The same issue could also happen if you upgrade numpy or scipy versions.
+
+A good practice is to save the scikit-learn, numpy and scipy versions to know
+exactly what versions have been used to generate the model. You can do that, for
+example, by executing a ``pip freeze`` command and saving the output to a text
+file which should be stored together with your pickles.
+Also, save a snapshot of your data to make it possible to retrain the model
+if incompatibility issues arise when upgrading the libraries.
+
+Regarding **security** issues, you may know that pickle is implemented with a
+stack machine that executes instructions. As a difference with other
+serialization methods like JSON, BSON, YAML, etc, which are all data oriented,
+pickle is instruction oriented. Pickle serializes objects by persisting a set of
+instructions that will be then executed at deserialization time in order to
+reconstruct your objects. In fact, as part of the deserialization process,
+pickle could call any arbitrary function, which opens up security
+vulnerabilities against any malicious data or exploits.
+
+Here is the warning from the official pickle documentation:
+
+.. warning::
+
+    The pickle module is not intended to be secure against erroneous or
+    maliciously constructed data.  Never unpickle data received from an untrusted
+    or unauthenticated source.
+    
+If you want to know more about these issues and explore other possible
+serialization methods, please refer to this
+`talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.  
+  
+  
+  
\ No newline at end of file
diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst
index 14630ff837a9d..784bf6f2911bd 100644
--- a/doc/tutorial/basic/tutorial.rst
+++ b/doc/tutorial/basic/tutorial.rst
@@ -234,3 +234,7 @@ and not to a string::
   >>> from sklearn.externals import joblib
   >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
 
+It's important for you to know that pickle has some security and maintainability
+issues. Please refer to section :ref:`model_persistence` for more detailed
+information about model persistence with scikit-learn.
+
diff --git a/doc/user_guide.rst b/doc/user_guide.rst
index 83f749c1981b1..0e66747232b8e 100644
--- a/doc/user_guide.rst
+++ b/doc/user_guide.rst
@@ -22,3 +22,4 @@
    Dataset loading utilities <datasets/index.rst>
    modules/scaling_strategies.rst
    modules/computational_performance.rst
+   model_persistence.rst

From df27e261566384898291de23a99faaa6a9546946 Mon Sep 17 00:00:00 2001
From: Raul Garreta <raul@tryolabs.com>
Date: Thu, 17 Apr 2014 16:59:37 -0400
Subject: [PATCH 2/6] model persistence doc, added improvements from ogrisel
 comments

---
 doc/model_persistence.rst       | 17 +++++++++++++++--
 doc/tutorial/basic/tutorial.rst | 20 ++++++++++++++++----
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/doc/model_persistence.rst b/doc/model_persistence.rst
index 7aabc4c4cf487..f45e06d000144 100644
--- a/doc/model_persistence.rst
+++ b/doc/model_persistence.rst
@@ -36,11 +36,24 @@ persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_
 
 In the specific case of the scikit, it may be more interesting to use
 joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
-which is more efficient on big data, but can only pickle to the disk
-and not to a string::
+which is more efficient on objects that carry large numpy arrays internally as
+is often the case for fitted scikit-learn estimators, but can only pickle to the
+disk and not to a string::
 
   >>> from sklearn.externals import joblib
   >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
+  
+Later you can load back the pickled model (possibly in another Python process)
+with::
+  
+  >>> clf = joblib.load('filename.pkl') # doctest:+SKIP
+
+.. note::
+
+   joblib.dump returns a list of filenames. Each individual numpy array
+   contained in the `clf` object is serialized as a separate file on the
+   filesystem. All files are required in the same folder when reloading the
+   model with joblib.load.
 
 
 Security & maintainability limitations
diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst
index 784bf6f2911bd..10685485bbc91 100644
--- a/doc/tutorial/basic/tutorial.rst
+++ b/doc/tutorial/basic/tutorial.rst
@@ -233,8 +233,20 @@ and not to a string::
 
   >>> from sklearn.externals import joblib
   >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
-
-It's important for you to know that pickle has some security and maintainability
-issues. Please refer to section :ref:`model_persistence` for more detailed
-information about model persistence with scikit-learn.
+  
+Later you can load back the pickled model (possibly in another Python process)
+with::
+  
+  >>> clf = joblib.load('filename.pkl') # doctest:+SKIP
+
+.. note::
+
+   joblib.dump returns a list of filenames. Each individual numpy array
+   contained in the `clf` object is serialized as a separate file on the
+   filesystem. All files are required in the same folder when reloading the
+   model with joblib.load.
+
+Note that pickle has some security and maintainability issues. Please refer to
+section :ref:`model_persistence` for more detailed information about model
+persistence with scikit-learn.
 

From be315ed0a6f6c4e51e525ff0caae869619b9faea Mon Sep 17 00:00:00 2001
From: Ignacio Rossi <rossi.ignacio@gmail.com>
Date: Wed, 25 Jun 2014 21:22:03 -0300
Subject: [PATCH 3/6] Move model persistence doc inside model selection section

---
 doc/model_selection.rst                 | 1 +
 doc/{ => modules}/model_persistence.rst | 0
 doc/user_guide.rst                      | 1 -
 3 files changed, 1 insertion(+), 1 deletion(-)
 rename doc/{ => modules}/model_persistence.rst (100%)

diff --git a/doc/model_selection.rst b/doc/model_selection.rst
index f54e0303d85a4..0e1d6e8ade04c 100644
--- a/doc/model_selection.rst
+++ b/doc/model_selection.rst
@@ -11,4 +11,5 @@ Model selection and evaluation
     modules/grid_search
     modules/pipeline
     modules/model_evaluation
+    modules/model_persistence
     modules/learning_curve
diff --git a/doc/model_persistence.rst b/doc/modules/model_persistence.rst
similarity index 100%
rename from doc/model_persistence.rst
rename to doc/modules/model_persistence.rst
diff --git a/doc/user_guide.rst b/doc/user_guide.rst
index 0e66747232b8e..83f749c1981b1 100644
--- a/doc/user_guide.rst
+++ b/doc/user_guide.rst
@@ -22,4 +22,3 @@
    Dataset loading utilities <datasets/index.rst>
    modules/scaling_strategies.rst
    modules/computational_performance.rst
-   model_persistence.rst

From 4c51809103c0b16825cfe22e27b90b69d58e9433 Mon Sep 17 00:00:00 2001
From: Ignacio Rossi <rossi.ignacio@gmail.com>
Date: Wed, 25 Jun 2014 22:32:50 -0300
Subject: [PATCH 4/6] Remove trailing whitespace

---
 doc/modules/model_persistence.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst
index f45e06d000144..83b7a38cb295d 100644
--- a/doc/modules/model_persistence.rst
+++ b/doc/modules/model_persistence.rst
@@ -42,10 +42,10 @@ disk and not to a string::
 
   >>> from sklearn.externals import joblib
   >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
-  
+
 Later you can load back the pickled model (possibly in another Python process)
 with::
-  
+
   >>> clf = joblib.load('filename.pkl') # doctest:+SKIP
 
 .. note::
@@ -89,10 +89,10 @@ Here is the warning from the official pickle documentation:
     The pickle module is not intended to be secure against erroneous or
     maliciously constructed data.  Never unpickle data received from an untrusted
     or unauthenticated source.
-    
+
 If you want to know more about these issues and explore other possible
 serialization methods, please refer to this
-`talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.  
-  
-  
-  
\ No newline at end of file
+`talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.
+
+
+

From 78c0f61060047886ed1ed4fbf23b4398c79201da Mon Sep 17 00:00:00 2001
From: Ignacio Rossi <rossi.ignacio@gmail.com>
Date: Wed, 25 Jun 2014 23:35:48 -0300
Subject: [PATCH 5/6] Simplified security and maintenance section

---
 doc/modules/model_persistence.rst | 39 +++++--------------------------
 1 file changed, 6 insertions(+), 33 deletions(-)

diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst
index 83b7a38cb295d..09522292db4c5 100644
--- a/doc/modules/model_persistence.rst
+++ b/doc/modules/model_persistence.rst
@@ -59,40 +59,13 @@ with::
 Security & maintainability limitations
 --------------------------------------
 
-You must be aware that pickle has some issues regarding maintainability and
-security. From the **maintainability** point of view, you should take care the
-issues that may arise if you upgrade your sklearn library while still loading a
-model that was trained with a previous version, the model may have a code
-structure that could not be compatible with newer versions and thus, don't work.
-The same issue could also happen if you upgrade numpy or scipy versions.
-
-A good practice is to save the scikit-learn, numpy and scipy versions to know
-exactly what versions have been used to generate the model. You can do that, for
-example, by executing a ``pip freeze`` command and saving the output to a text
-file which should be stored together with your pickles.
-Also, save a snapshot of your data to make it possible to retrain the model
-if incompatibility issues arise when upgrading the libraries.
-
-Regarding **security** issues, you may know that pickle is implemented with a
-stack machine that executes instructions. As a difference with other
-serialization methods like JSON, BSON, YAML, etc, which are all data oriented,
-pickle is instruction oriented. Pickle serializes objects by persisting a set of
-instructions that will be then executed at deserialization time in order to
-reconstruct your objects. In fact, as part of the deserialization process,
-pickle could call any arbitrary function, which opens up security
-vulnerabilities against any malicious data or exploits.
-
-Here is the warning from the official pickle documentation:
-
-.. warning::
-
-    The pickle module is not intended to be secure against erroneous or
-    maliciously constructed data.  Never unpickle data received from an untrusted
-    or unauthenticated source.
+pickle (and joblib by extension), has some issues regarding maintainability
+and security. Because of this,
+
+* Never unpickle untrusted data
+* Models saved in one version of scikit-learn might not load in another
+  version.
 
 If you want to know more about these issues and explore other possible
 serialization methods, please refer to this
 `talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.
-
-
-

From 4b79f220a405378f27f8d0fc6102388ba7420152 Mon Sep 17 00:00:00 2001
From: Ignacio Rossi <rossi.ignacio@gmail.com>
Date: Thu, 26 Jun 2014 12:31:24 -0300
Subject: [PATCH 6/6] Metadata information for unpickling models in future
 versions

---
 doc/modules/model_persistence.rst | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/doc/modules/model_persistence.rst b/doc/modules/model_persistence.rst
index 09522292db4c5..629df68cca9c0 100644
--- a/doc/modules/model_persistence.rst
+++ b/doc/modules/model_persistence.rst
@@ -66,6 +66,17 @@ and security. Because of this,
 * Models saved in one version of scikit-learn might not load in another
   version.
 
+In order to rebuild a similar model with future versions of scikit-learn,
+additional metadata should be saved along the pickled model:
+
+* The training data, e.g. a reference to a immutable snapshot
+* The python source code used to generate the model
+* The versions of scikit-learn and its dependencies
+* The cross validation score obtained on the training data
+
+This should make it possible to check that the cross-validation score is in the
+same range as before.
+
 If you want to know more about these issues and explore other possible
 serialization methods, please refer to this
 `talk by Alex Gaynor <http://pyvideo.org/video/2566/pickles-are-for-delis-not-software>`_.