8000 More improvements to the documentation on model persistence by ogrisel · Pull Request #29011 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

More improvements to the documentation on model persistence #29011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 14, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 48 additions & 24 deletions doc/model_persistence.rst
10000
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,9 @@ persist and plan to serve the model:
- :ref:`ONNX <onnx_persistence>`: You need an `ONNX` runtime and an environment
with appropriate dependencies installed to load the model and use the runtime
to get predictions. This environment can be minimal and does not necessarily
even require `python` to be installed.
even require Python to be installed to load the model and compute
predictions. Also note that `onnxruntime` typically requires much less RAM
than Python to to compute predictions from small models.

- :mod:`skops.io`, :mod:`pickle`, :mod:`joblib`, `cloudpickle`_: You need a
Python environment with the appropriate dependencies installed to load the
Expand Down Expand Up @@ -208,13 +210,20 @@ persist and load your scikit-learn model, and they all follow the same API::

# Here you can replace pickle with joblib or cloudpickle
from pickle import dump
with open('filename.pkl', 'wb') as f: dump(clf, f)
with open("filename.pkl", "wb") as f:
dump(clf, f, protocol=5)

Using `protocol=5` is recommended to reduce memory usage and make it faster to
store and load any large NumPy array stored as a fitted attribute in the model.
You can alternatively pass `protocol=pickle.HIGHEST_PROTOCOL` which is
equivalent to `protocol=5` in Python 3.8 and later (at the time of writing).

And later when needed, you can load the same object from the persisted file::

# Here you can replace pickle with joblib or cloudpickle
from pickle import load
with open('filename.pkl', 'rb') as f: clf = load(f)
with open("filename.pkl", "rb") as f:
clf = load(f)

|details-end|

Expand All @@ -224,12 +233,14 @@ Security & Maintainability Limitations
--------------------------------------

:mod:`pickle` (and :mod:`joblib` and :mod:`clouldpickle` by extension), has
many documented security vulnerabilities and should only be used if the
artifact, i.e. the pickle-file, is coming from a trusted and verified source.
many documented security vulnerabilities by design and should only be used if
the artifact, i.e. the pickle-file, is coming from a trusted and verified
source. You should never load a pickle file from an untrusted source, similarly
to how you should never execute code from an untrusted source.

Also note that arbitrary computations can be represented using the `ONNX`
format, and therefore a sandbox used to serve models using `ONNX` also needs to
safeguard against computational and memory exploits.
format, and it is therefore recommended to serve models using `ONNX` in a
sandboxed environment to safeguard against computational and memory exploits.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali do you have a good reference for the threat model of loading ONNX files from untrusted origins?

It seems that is has some filesystem access (via External Data Files) but overall it seems quite safe to load and run inference on untrusted ONNX files with onnxruntime (assuming no security bugs in onnxruntime itself).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the assessment. It's much better than loading pickle, and the potential issues can be handled if one limits the resources available to the onnxruntime via sandboxing.


Also note that there are no supported ways to load a model trained with a
different version of scikit-learn. While using :mod:`skops.io`, :mod:`joblib`,
Expand Down Expand Up @@ -298,7 +309,8 @@ can be caught to obtain the original version the estimator was pickled with::
warnings.simplefilter("error", InconsistentVersionWarning)

try:
est = pickle.loads("model_from_prevision_version.pickle")
with open("model_from_prevision_version.pickle", "rb") as f:
est = pickle.load(f)
except InconsistentVersionWarning as w:
print(w.original_sklearn_version)

Expand Down Expand Up @@ -328,22 +340,34 @@ each approach can be summarized as follows:
* :mod:`skops.io`: Trained scikit-learn models can be easily shared and put
into production using :mod:`skops.io`. It is more secure compared to
alternate approaches based on :mod:`pickle` because it does not load
arbitrary code unless explicitly asked for by the user.
arbitrary code unless explicitly asked for by the user. Such code needs to be
packaged and importable in the target Python environment.
* :mod:`joblib`: Efficient memory mapping techniques make it faster when using
the same persisted model in multiple Python processes. It also gives easy
shortcuts to compress and decompress the persisted object without the need
for extra code. However, it may trigger the execution of malicious code while
untrusted data as any other pickle-based persistence mechanism.
* :mod:`pickle`: It is native to Python and any Python object can be serialized
and deserialized using :mod:`pickle`, including custom Python classes and
objects. While :mod:`pickle` can be used to easily save and load scikit-learn
models, it may trigger the execution of malicious code while loading
untrusted data.
* `cloudpickle`_: It is slower than :mod:`pickle` and :mod:`joblib`, and is
more insecure than :mod:`pickle` and :mod:`joblib` since it can serialize
arbitrary code. However, in certain cases it might be a last resort to
persist certain models. Note that this is discouraged by `cloudpickle`_
itself since there are no forward compatibility guarantees and you might need
the same version of `cloudpickle`_ to load the persisted model.
the same persisted model in multiple Python processes when using
`mmap_mode="r"`. It also gives easy shortcuts to compress and decompress the
persisted object without the need for extra code. However, it may trigger the
execution of malicious code when loading a model from an untrusted source as
any other pickle-based persistence mechanism.
* :mod:`pickle`: It is native to Python and most Python objects can be
serialized and deserialized using :mod:`pickle`, including custom Python
classes and functions as long as they are defined in a package that can be
imported in the target environment. While :mod:`pickle` can be used to easily
save and load scikit-learn models, it may trigger the execution of malicious
code while loading a model from an untrusted source. :mod:`pickle` can also
be very efficient memorywise if the model was persisted with `protocol=5` but
it does not support memory mapping.
* `cloudpickle`_: It has comparable loading efficiency as :mod:`pickle` and
:mod:`joblib` (without memory mapping), but offers additional flexibility to
serialize custom Python code such as lambda expressions and interactively
defined functions and classes. It might be a last resort to persist pipelines
with custom Python components such as a
:class:`sklearn.preprocessing.FunctionTransformer` that wraps a function
defined in the training script itself or more generally outside of any
importable Python package. Note that `cloudpickle`_ offers no forward
compatibility guarantees and you might need the same version of
`cloudpickle`_ to load the persisted model along with the same version of all
the libraries used to define the model. As the other pickle-based persistence
mechanisms, it may trigger the execution of malicious code while loading
a model from an untrusted source.

.. _cloudpickle: https://github.com/cloudpipe/cloudpickle
Loading
0