@@ -80,7 +80,9 @@ persist and plan to serve the model:
80
80
- :ref: `ONNX <onnx_persistence >`: You need an `ONNX ` runtime and an environment
81
81
with appropriate dependencies installed to load the model and use the runtime
82
82
to get predictions. This environment can be minimal and does not necessarily
83
- even require `python ` to be installed.
83
+ even require Python to be installed to load the model and compute
84
+ predictions. Also note that `onnxruntime ` typically requires much less RAM
85
+ than Python to to compute predictions from small models.
84
86
85
87
- :mod: `skops.io `, :mod: `pickle `, :mod: `joblib `, `cloudpickle `_: You need a
86
88
Python environment with the appropriate dependencies installed to load the
@@ -208,13 +210,20 @@ persist and load your scikit-learn model, and they all follow the same API::
208
210
209
211
# Here you can replace pickle with joblib or cloudpickle
210
212
from pickle import dump
211
- with open('filename.pkl', 'wb') as f: dump(clf, f)
213
+ with open("filename.pkl", "wb") as f:
214
+ dump(clf, f, protocol=5)
215
+
216
+ Using `protocol=5 ` is recommended to reduce memory usage and make it faster to
217
+ store and load any large NumPy array stored as a fitted attribute in the model.
218
+ You can alternatively pass `protocol=pickle.HIGHEST_PROTOCOL ` which is
219
+ equivalent to `protocol=5 ` in Python 3.8 and later (at the time of writing).
212
220
213
221
And later when needed, you can load the same object from the persisted file::
214
222
215
223
# Here you can replace pickle with joblib or cloudpickle
216
224
from pickle import load
217
- with open('filename.pkl', 'rb') as f: clf = load(f)
225
+ with open("filename.pkl", "rb") as f:
226
+ clf = load(f)
218
227
219
228
|details-end |
220
229
@@ -224,12 +233,14 @@ Security & Maintainability Limitations
224
233
--------------------------------------
225
234
226
235
:mod: `pickle ` (and :mod: `joblib ` and :mod: `clouldpickle ` by extension), has
227
- many documented security vulnerabilities and should only be used if the
228
- artifact, i.e. the pickle-file, is coming from a trusted and verified source.
236
+ many documented security vulnerabilities by design and should only be used if
237
+ the artifact, i.e. the pickle-file, is coming from a trusted and verified
238
+ source. You should never load a pickle file from an untrusted source, similarly
239
+ to how you should never execute code from an untrusted source.
229
240
230
241
Also note that arbitrary computations can be represented using the `ONNX `
231
- format, and therefore a sandbox used to serve models using `ONNX ` also needs to
232
- safeguard against computational and memory exploits.
242
+ format, and it is therefore recommended to serve models using `ONNX ` in a
243
+ sandboxed environment to safeguard against computational and memory exploits.
233
244
234
245
Also note that there are no supported ways to load a model trained with a
235
246
different version of scikit-learn. While using :mod: `skops.io `, :mod: `joblib `,
@@ -298,7 +309,8 @@ can be caught to obtain the original version the estimator was pickled with::
298
309
warnings.simplefilter("error", InconsistentVersionWarning)
299
310
300
311
try:
301
- est = pickle.loads("model_from_prevision_version.pickle")
312
+ with open("model_from_prevision_version.pickle", "rb") as f:
313
+ est = pickle.load(f)
302
314
except InconsistentVersionWarning as w:
303
315
print(w.original_sklearn_version)
304
316
@@ -328,22 +340,34 @@ each approach can be summarized as follows:
328
340
* :mod: `skops.io `: Trained scikit-learn models can be easily shared and put
329
341
into production using :mod: `skops.io `. It is more secure compared to
330
342
alternate approaches based on :mod: `pickle ` because it does not load
331
- arbitrary code unless explicitly asked for by the user.
343
+ arbitrary code unless explicitly asked for by the user. Such code needs to be
344
+ packaged and importable in the target Python environment.
332
345
* :mod: `joblib `: Efficient memory mapping techniques make it faster when using
333
- the same persisted model in multiple Python processes. It also gives easy
334
- shortcuts to compress and decompress the persisted object without the need
335
- for extra code. However, it may trigger the execution of malicious code while
336
- untrusted data as any other pickle-based persistence mechanism.
337
- * :mod: `pickle `: It is native to Python and any Python object can be serialized
338
- and deserialized using :mod: `pickle `, including custom Python classes and
339
- objects. While :mod: `pickle ` can be used to easily save and load scikit-learn
340
- models, it may trigger the execution of malicious code while loading
341
- untrusted data.
342
- * `cloudpickle `_: It is slower than :mod: `pickle ` and :mod: `joblib `, and is
343
- more insecure than :mod: `pickle ` and :mod: `joblib ` since it can serialize
344
- arbitrary code. However, in certain cases it might be a last resort to
345
- persist certain models. Note that this is discouraged by `cloudpickle `_
346
- itself since there are no forward compatibility guarantees and you might need
347
- the same version of `cloudpickle `_ to load the persisted model.
346
+ the same persisted model in multiple Python processes when using
347
+ `mmap_mode="r" `. It also gives easy shortcuts to compress and decompress the
348
+ persisted object without the need for extra code. However, it may trigger the
349
+ execution of malicious code when loading a model from an untrusted source as
350
+ any other pickle-based persistence mechanism.
351
+ * :mod: `pickle `: It is native to Python and most Python objects can be
352
+ serialized and deserialized using :mod: `pickle `, including custom Python
353
+ classes and functions as long as they are defined in a package that can be
354
+ imported in the target environment. While :mod: `pickle ` can be used to easily
355
+ save and load scikit-learn models, it may trigger the execution of malicious
356
+ code while loading a model from an untrusted source. :mod: `pickle ` can also
357
+ be very efficient memorywise if the model was persisted with `protocol=5 ` but
358
+ it does not support memory mapping.
359
+ * `cloudpickle `_: It has comparable loading efficiency as :mod: `pickle ` and
360
+ :mod: `joblib ` (without memory mapping), but offers additional flexibility to
361
+ serialize custom Python code such as lambda expressions and interactively
362
+ defined functions and classes. It might be a last resort to persist pipelines
363
+ with custom Python components such as a
364
+ :class: `sklearn.preprocessing.FunctionTransformer ` that wraps a function
365
+ defined in the training script itself or more generally outside of any
366
+ importable Python package. Note that `cloudpickle `_ offers no forward
367
+ compatibility guarantees and you might need the same version of
368
+ `cloudpickle `_ to load the persisted model along with the same version of all
369
+ the libraries used to define the model. As the other pickle-based persistence
370
+ mechanisms, it may trigger the execution of malicious code while loading
371
+ a model from an untrusted source.
348
372
349
373
.. _cloudpickle : https://github.com/cloudpipe/cloudpickle
0 commit comments