10000 PEP 784: Adding Zstandard to the standard library (#4348) · python/peps@77031b7 · GitHub
[go: up one dir, main page]

Skip to content

Commit 77031b7

Browse files
PEP 784: Adding Zstandard to the standard library (#4348)
* Add initial PEP text * Make the deprecation stronger, add a bunch more details * Add more about the compression namespace * Move compression section earlier * Respond to review by Rogdham * Update PEP number to 784 * Re-target to Python 3.15 * Add note about zipfile integration * Rewrite the early motivation section based on Greg's advice * Re-target to Python 3.14, optimistically * Rewrite the deprecation/removal timeline * Add Greg to CODEOWNERS for the PEP * Remove extraneous apostrophe * Remove open issues section --------- Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
1 parent a06ac7f commit 77031b7

File tree

2 files changed

+281
-0
lines changed

2 files changed

+281
-0
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -661,6 +661,7 @@ peps/pep-0779.rst @Yhg1s @colesbury @mpage
661661
peps/pep-0780.rst @lysnikolaou
662662
peps/pep-0781.rst @methane
663663
peps/pep-0782.rst @vstinner
664+
peps/pep-0784.rst @gpshead
664665
# ...
665666
peps/pep-0789.rst @njsmith
666667
# ...

peps/pep-0784.rst

Lines changed: 280 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
PEP: 784
2+
Title: Adding Zstandard to the standard library
3+
Author: Emma Harper Smith <emma@python.org>
4+
Sponsor: Gregory P. Smith <greg@krypto.org>
5+
Status: Draft
6+
Type: Standards Track
7+
Created: 06-Apr-2025
8+
Python-Version: 3.14
9+
10+
Abstract
11+
========
12+
13+
`Zstandard <https://facebook.github.io/zstd/>`_ is a widely adopted, mature,
14+
and highly efficient compression standard. This PEP proposes adding a new
15+
module to the Python standard library containing a Python wrapper around Meta's
16+
``zstd`` library, the default implementation. Additionally, to avoid name
17+
collisions with packages on PyPI and to present a unified interface to Python
18+
users, compression modules in the standard library will be moved under a
19+
``compression.*`` namespace package.
20+
21+
Motivation
22+
==========
23+
24+
CPython has modules for several different compression formats, such as `zlib
25+
(DEFLATE) <https://docs.python.org/3/library/zlib.html>`_,
26+
`bzip2 <https://docs.python.org/3/library/bz2.html>`_,
27+
and `lzma <https://docs.python.org/3/libra 10000 ry/lzma.html>`_, each widely used.
28+
Including popular compression algorithms matches Python's "batteries included"
29+
philosophy of incorporating widely useful standards and utilities. The last
30+
compression module added to the language was ``lzma``, added in Python 3.3.
31+
32+
Since then, Zstandard has become the modern de facto preferred compression
33+
library for both high performance compression and decompression attaining high
34+
compression ratios at reasonable CPU and memory cost. Zstandard achieves a much
35+
higher compression ratio than bzip2 or zlib (DEFLATE) while decompressing
36+
significantly faster than LZMA.
37+
38+
Zstandard has seen `widespread adoption in many different areas of computing
39+
<https://facebook.github.io/zstd/#references>`_. The numerous hardware
40+
implementations demonstrate long-term commitment to Zstandard and an
41+
expectation that Zstandard will stay the de facto choice for compression for
42+
years to come. Zstandard compression is also implemented in both the ZFS and
43+
Btrfs filesystems.
44+
45+
Zstandard's highly efficient compression has supplanted other modern
46+
compression formats, such as brotli, lzo, and ucl due to its highly efficient
47+
compression. While `LZ4 <https://lz4.org/>`_ is still used in very high
48+
throughput scenarios, Zstandard can also be used in some of these contexts.
49+
While inclusion of LZ4 is out of scope, it would be a compelling future
50+
addition to the ``compression`` namespace introduced by this PEP.
51+
52+
There are several bindings to Zstandard for Python available on PyPI, each with
53+
different APIs and choices of how to bind the ``zstd`` library. One goal with
54+
introducing an official module in the standard library is to reduce confusion
55+
for Python users who want simple compression/decompression APIs for Zstandard.
56+
The existing packages can continue providing extended APIs and bindings for
57+
other Python implementations such as PyPy or integrate features from newer
58+
Zstandard versions.
59+
60+
Another reason to add Zstandard support to the standard library is to resolve
61+
a long standing `open issue <https://github.com/python/cpython/issues/81276>`_
62+
requesting Zstandard support in the ``tarfile`` module. This issue has the 5th
63+
most "thumbs up" of open issues on the CPython tracker, and has garnered a
64+
significant amount of discussion and interest. Additionally, the `ZIP format
65+
standardizes a Zstandard compression format ID
66+
<https://pkwaredownloads.blob.core.windows.net/pkware-general/Documentation/APPNOTE-6.3.8.TXT>`_,
67+
and integration with ``zipfile`` would allow opening ZIP archives using
68+
Zstandard compression. The reference implementation for this PEP contains
69+
integration with the ``zipfile``, ``tarfile``, and ``shutil`` modules.
70+
71+
Zstandard compression could also be used to make Python wheel packages smaller
72+
and significantly faster to install. Anaconda found a sizeable speedup when
73+
adopting Zstandard for the conda package format
74+
75+
.. epigraph::
76+
77+
Conda's download sizes are reduced ~30-40%, and extraction is dramatically faster.
78+
[...]
79+
We see approximately a 2.5x overall speedup, almost all thanks to the dramatically faster extraction speed of the zstd compression used in the new file format.
80+
81+
-- `Anaconda blog on Zstandard adoption <https://www.anaconda.com/blog/how-we-made-conda-faster-4-7>`_
82+
83+
`According to lzbench <https://github.com/inikep/lzbench?tab=readme-ov-file#benchmarks>`_,
84+
a comprehensive benchmark of many different compression libraries and formats,
85+
Zstandard has a significantly higher compression ratio compared to wheel's
86+
existing zlib-based compression. While this PEP does *not* prescribe any
87+
changes to the wheel format or other packaging standards, having Zstandard
88+
bindings in the standard library would enable a future PEP to improve the user
89+
experience for Python wheel packages.
90+
91+
Rationale
92+
=========
93+
94+
Introduction of a ``compression`` namespace
95+
-------------------------------------------
96+
97+
Both the ``zstd`` and ``zstandard`` import names are claimed by projects on
98+
PyPI. To avoid breaking users of one of the existing bindings, this PEP
99+
proposes introducing a new namespace for compression libraries,
100+
``compression``. This name is already reserved on PyPI for use in the
101+
standard library. The new Zstandard module will be ``compression.zstd``.
102+
Other compression modules will be re-exported to the ``compression`` namespace
103+
and their current import names will be deprecated.
104+
105+
Providing a common namespace for compression modules has several advantages.
106+
First, it reduces user confusion about where to find compression modules.
107+
Second, the top level ``compression`` module could provide information on which
108+
compression formats are available, similar to ``hashlib``'s
109+
``algorithms_available``. If :pep:`775` is accepted, a
110+
``compression.algorithms_guaranteed`` could be provided as well, listing
111+
``zlib``. Finally, a ``compression`` namespace prevents future issues with
112+
merging other compression formats into the standard library. New compression
113+
formats will likely be published to PyPI prior to integration into
114+
CPython. Therefore, any new compression format import name will likely already
115+
be claimed by the time a module would be considered for inclusion in CPython.
116+
Putting compression modules under a package prefix prevents issues with
117+
potential future name clashes.
118+
119+
Code that would like to remain compatible across Python versions may use the
120+
following pattern to ensure compatibility::
121+
122+
try:
123+
from compression.lzma import LZMAFile
124+
except ImportError:
125+
from lzma import LZMAFile
126+
127+
This will use the newer import name when available and fall back to the old
128+
name otherwise.
129+
130+
Implementation based on ``pyzstd``
131+
----------------------------------
132+
133+
The implementation for this PEP is based on the `pyzstd project <https://github.com/Rogdham/pyzstd>`_.
134+
This project was chosen as the code was `originally written to be upstreamed <https://github.com/python/cpython/issues/81276#issuecomment-1093824963>`_
135+
to CPython by Ma Lin, who also wrote the `output buffer implementation used in
136+
the standard library today <https://github.com/python/cpython/commit/f9bedb630e8a0b7d94e1c7e609b20dfaa2b22231>`_.
137+
The project has since been taken over by Rogdham and is published to PyPI. The
138+
APIs in ``pyzstd`` are similar to the APIs for other compression modules in the
139+
standard library such as ``bz2`` and ``lzma``.
140+
141+
Minimum supported Zstandard version
142+
-----------------------------------
143+
144+
The minimum supported Zstandard was chosen as v1.4.5, released in May of 2020.
145+
This version was chosen as a minimum based on reviewing the versions of
146+
Zstandard available in a number of Linux distribution package repositories,
147+
including LTS releases. This version choice is rather conservative to maximize
148+
compatibility with existing LTS Linux distributions, but a newer Zstandard
149+
version could likely be chosen given that newer Python releases are generally
150+
packaged as part of newer distribution releases.
151+
152+
Specification
153+
=============
154+
155+
The ``compression`` namespace
156+
-----------------------------
157+
158+
A new namespace package for compression modules will be introduced named
159+
``compression``. The top-level module for this package will be empty to begin
160+
with, but a standard API for interacting with compression routines may be
161+
added in the future to the toplevel.
162+
163+
The ``compression.zstd`` module
164+
-------------------------------
165+
166+
A new module, ``compression.zstd`` will be introduced with Zstandard
167+
compression APIs that match other compression modules in the standard library,
168+
namely
169+
170+
* ``compress`` / ``decompress`` - APIs for one-shot compression/decompression
171+
* ``ZstdFile`` / ``open`` - APIs for interacting with streams and file-like
172+
objects
173+
* ``ZstdCompressor`` / ``ZstdDecompressor`` - APIs for incremental compression/
174+
decompression
175+
176+
It will also contain some Zstandard-specific functionality
177+
178+
* ``ZstdDict`` / ``train_dict`` / ``finalize_dict`` - APIs for interacting with
179+
Zstandard dictionaries, which are useful for compressing many small chunks of
180+
similar data
181+
182+
``libzstd`` optional dependency
183+
-------------------------------
184+
185+
The ``libzstd`` library will become an optional dependency of CPython. If the
186+
library is not available, the ``compression.zstd`` module will be unavailable.
187+
This is handled automatically on Unix platforms as part of the normal build
188+
environment detection.
189+
190+
On Windows, ``libzstd`` will be added to
191+
`the source dependencies <https://github.com/python/cpython-source-deps>`_
192+
used to build libraries CPython depends on for Windows.
193+
194+
Other compression modules
195+
-------------------------
196+
197+
New import names ``compression.lzma``, ``compression.bz2``, and
198+
``compression.zlib`` will be introduced in Python 3.14 re-exporting the
199+
contents of the existing ``lzma``, ``bz2``, and ``zlib`` modules respectively.
200+
201+
The ``_compression`` module, given that it is marked private, will be
202+
immediately renamed to ``compression._common.streams``. The new name was
203+
selected due to the current contents of the module being I/O related helpers
204+
for stream APIs (e.g. ``LZMAFile``) in standard library compression modules.
205+
206+
Compression module migration timeline
207+
-------------------------------------
208+
209+
Existing modules will emit a ``DeprecationWarning`` in the Python
210+
release following the last Python without the ``compression`` module leaving
211+
support. For example, if the ``compression`` namespace is introduced in 3.14,
212+
then the ``DeprecationWarnings`` would be emitted in 3.19, the next release
213+
after 3.13 reaches end of life. Following the standard deprecation timeline
214+
specified in :pep:`387`, in Python 3.24 the existing modules will be removed
215+
and code must use the ``compression`` sub-modules. The documentation for these
216+
modules will be updated to discuss the planned deprecation and removal
217+
timelines.
218+
219+
220+
Backwards Compatibility
221+
=======================
222+
223+
The main compatibility concern is usage of existing standard library
224+
compression APIs with the existing import names. These names will be
225+
deprecated in 3.19 and will be removed in 3.24. Given the long coexistance of
226+
the modules and a 5 year deprecation period, most users will likely migrate to
227+
the new import names well before then. Additionally, a libCST codemod can be
228+
provided to automatically rewrite imports, easing the migration.
229+
230+
Security Implications
231+
=====================
232+
233+
As with any new C code, especially code operating on potentially untrusted user
234+
input, there are risks of memory safety issues. The author plans on
235+
contributing integration with libfuzzer to enable fuzzing the ``_zstd`` code
236+
and ensure it is robust. Furthermore, there are a number of tests that exercise
237+
the compression and decompression routines. These tests pass without error when
238+
compiled with AddressSanitizer.
239+
240+
Taking on a new dependency also always has security risks, but the ``zstd``
241+
library is mature, fuzzed on each commit, and `participates in Meta's bug bounty
242+
program <https://github.com/facebook/zstd/blob/dev/SECURITY.md>`_.
243+
244+
How to Teach This
245+
=================
246+
247+
Documentation for the new module is in the reference implementation branch. The
248+
documentation for other modules will be updated to discuss the deprecation of
249+
their existing import names, and how to migrate.
250+
251+
Reference Implementation
252+
========================
253+
254+
The `reference implementation <https://github.com/emmatyping/cpython/tree/zstd>`_
255+
contains the ``_zstd`` C code, the ``compression.zstd`` code, modifications to
256+
``tarfile``, ``shutil``, and ``zipfile``, and tests for each new API and
257+
integration added. It also contains the re-exports of other compression
258+
modules. Deprecations for the existing import names will be added once a
259+
decision is reached regarding the open issues.
260+
261+
Rejected Ideas
262+
==============
263+
264+
Name the module ``libzstd`` and do not make a new ``compression`` namespace
265+
---------------------------------------------------------------------------
266+
267+
One option instead of making a new ``compression`` namespace would be to find
268+
a different name, such as ``libzstd``, as the import name. However, the issue
269+
of existing import names is likely to persist for future compression formats
270+
added to the standard library. LZ4, a common high speed compression format,
271+
has `a package on PyPI <https://pypi.org/project/lz4/>`_, ``lz4``, with the
272+
import name ``lz4``. Instead of solving this issue for each compression format,
273+
it is better to solve it once and for all by using the already-claimed
274+
``compression`` namespace.
275+
276+
Copyright
277+
=========
278+
279+
This document is placed in the public domain or under the
280+
CC0-1.0-Universal license, whichever is more permissive.

0 commit comments

Comments
 (0)
0