|
| 1 | +PEP: 784 |
| 2 | +Title: Adding Zstandard to the standard library |
| 3 | +Author: Emma Harper Smith <emma@python.org> |
| 4 | +Sponsor: Gregory P. Smith <greg@krypto.org> |
| 5 | +Status: Draft |
| 6 | +Type: Standards Track |
| 7 | +Created: 06-Apr-2025 |
| 8 | +Python-Version: 3.14 |
| 9 | + |
| 10 | +Abstract |
| 11 | +======== |
| 12 | + |
| 13 | +`Zstandard <https://facebook.github.io/zstd/>`_ is a widely adopted, mature, |
| 14 | +and highly efficient compression standard. This PEP proposes adding a new |
| 15 | +module to the Python standard library containing a Python wrapper around Meta's |
| 16 | +``zstd`` library, the default implementation. Additionally, to avoid name |
| 17 | +collisions with packages on PyPI and to present a unified interface to Python |
| 18 | +users, compression modules in the standard library will be moved under a |
| 19 | +``compression.*`` namespace package. |
| 20 | + |
| 21 | +Motivation |
| 22 | +========== |
| 23 | + |
| 24 | +CPython has modules for several different compression formats, such as `zlib |
| 25 | +(DEFLATE) <https://docs.python.org/3/library/zlib.html>`_, |
| 26 | +`bzip2 <https://docs.python.org/3/library/bz2.html>`_, |
| 27 | +and `lzma <https://docs.python.org/3/libra
10000
ry/lzma.html>`_, each widely used. |
| 28 | +Including popular compression algorithms matches Python's "batteries included" |
| 29 | +philosophy of incorporating widely useful standards and utilities. The last |
| 30 | +compression module added to the language was ``lzma``, added in Python 3.3. |
| 31 | + |
| 32 | +Since then, Zstandard has become the modern de facto preferred compression |
| 33 | +library for both high performance compression and decompression attaining high |
| 34 | +compression ratios at reasonable CPU and memory cost. Zstandard achieves a much |
| 35 | +higher compression ratio than bzip2 or zlib (DEFLATE) while decompressing |
| 36 | +significantly faster than LZMA. |
| 37 | + |
| 38 | +Zstandard has seen `widespread adoption in many different areas of computing |
| 39 | +<https://facebook.github.io/zstd/#references>`_. The numerous hardware |
| 40 | +implementations demonstrate long-term commitment to Zstandard and an |
| 41 | +expectation that Zstandard will stay the de facto choice for compression for |
| 42 | +years to come. Zstandard compression is also implemented in both the ZFS and |
| 43 | +Btrfs filesystems. |
| 44 | + |
| 45 | +Zstandard's highly efficient compression has supplanted other modern |
| 46 | +compression formats, such as brotli, lzo, and ucl due to its highly efficient |
| 47 | +compression. While `LZ4 <https://lz4.org/>`_ is still used in very high |
| 48 | +throughput scenarios, Zstandard can also be used in some of these contexts. |
| 49 | +While inclusion of LZ4 is out of scope, it would be a compelling future |
| 50 | +addition to the ``compression`` namespace introduced by this PEP. |
| 51 | + |
| 52 | +There are several bindings to Zstandard for Python available on PyPI, each with |
| 53 | +different APIs and choices of how to bind the ``zstd`` library. One goal with |
| 54 | +introducing an official module in the standard library is to reduce confusion |
| 55 | +for Python users who want simple compression/decompression APIs for Zstandard. |
| 56 | +The existing packages can continue providing extended APIs and bindings for |
| 57 | +other Python implementations such as PyPy or integrate features from newer |
| 58 | +Zstandard versions. |
| 59 | + |
| 60 | +Another reason to add Zstandard support to the standard library is to resolve |
| 61 | +a long standing `open issue <https://github.com/python/cpython/issues/81276>`_ |
| 62 | +requesting Zstandard support in the ``tarfile`` module. This issue has the 5th |
| 63 | +most "thumbs up" of open issues on the CPython tracker, and has garnered a |
| 64 | +significant amount of discussion and interest. Additionally, the `ZIP format |
| 65 | +standardizes a Zstandard compression format ID |
| 66 | +<https://pkwaredownloads.blob.core.windows.net/pkware-general/Documentation/APPNOTE-6.3.8.TXT>`_, |
| 67 | +and integration with ``zipfile`` would allow opening ZIP archives using |
| 68 | +Zstandard compression. The reference implementation for this PEP contains |
| 69 | +integration with the ``zipfile``, ``tarfile``, and ``shutil`` modules. |
| 70 | + |
| 71 | +Zstandard compression could also be used to make Python wheel packages smaller |
| 72 | +and significantly faster to install. Anaconda found a sizeable speedup when |
| 73 | +adopting Zstandard for the conda package format |
| 74 | + |
| 75 | +.. epigraph:: |
| 76 | + |
| 77 | + Conda's download sizes are reduced ~30-40%, and extraction is dramatically faster. |
| 78 | + [...] |
| 79 | + We see approximately a 2.5x overall speedup, almost all thanks to the dramatically faster extraction speed of the zstd compression used in the new file format. |
| 80 | + |
| 81 | + -- `Anaconda blog on Zstandard adoption <https://www.anaconda.com/blog/how-we-made-conda-faster-4-7>`_ |
| 82 | + |
| 83 | +`According to lzbench <https://github.com/inikep/lzbench?tab=readme-ov-file#benchmarks>`_, |
| 84 | +a comprehensive benchmark of many different compression libraries and formats, |
| 85 | +Zstandard has a significantly higher compression ratio compared to wheel's |
| 86 | +existing zlib-based compression. While this PEP does *not* prescribe any |
| 87 | +changes to the wheel format or other packaging standards, having Zstandard |
| 88 | +bindings in the standard library would enable a future PEP to improve the user |
| 89 | +experience for Python wheel packages. |
| 90 | + |
| 91 | +Rationale |
| 92 | +========= |
| 93 | + |
| 94 | +Introduction of a ``compression`` namespace |
| 95 | +------------------------------------------- |
| 96 | + |
| 97 | +Both the ``zstd`` and ``zstandard`` import names are claimed by projects on |
| 98 | +PyPI. To avoid breaking users of one of the existing bindings, this PEP |
| 99 | +proposes introducing a new namespace for compression libraries, |
| 100 | +``compression``. This name is already reserved on PyPI for use in the |
| 101 | +standard library. The new Zstandard module will be ``compression.zstd``. |
| 102 | +Other compression modules will be re-exported to the ``compression`` namespace |
| 103 | +and their current import names will be deprecated. |
| 104 | + |
| 105 | +Providing a common namespace for compression modules has several advantages. |
| 106 | +First, it reduces user confusion about where to find compression modules. |
| 107 | +Second, the top level ``compression`` module could provide information on which |
| 108 | +compression formats are available, similar to ``hashlib``'s |
| 109 | +``algorithms_available``. If :pep:`775` is accepted, a |
| 110 | +``compression.algorithms_guaranteed`` could be provided as well, listing |
| 111 | +``zlib``. Finally, a ``compression`` namespace prevents future issues with |
| 112 | +merging other compression formats into the standard library. New compression |
| 113 | +formats will likely be published to PyPI prior to integration into |
| 114 | +CPython. Therefore, any new compression format import name will likely already |
| 115 | +be claimed by the time a module would be considered for inclusion in CPython. |
| 116 | +Putting compression modules under a package prefix prevents issues with |
| 117 | +potential future name clashes. |
| 118 | + |
| 119 | +Code that would like to remain compatible across Python versions may use the |
| 120 | +following pattern to ensure compatibility:: |
| 121 | + |
| 122 | + try: |
| 123 | + from compression.lzma import LZMAFile |
| 124 | + except ImportError: |
| 125 | + from lzma import LZMAFile |
| 126 | + |
| 127 | +This will use the newer import name when available and fall back to the old |
| 128 | +name otherwise. |
| 129 | + |
| 130 | +Implementation based on ``pyzstd`` |
| 131 | +---------------------------------- |
| 132 | + |
| 133 | +The implementation for this PEP is based on the `pyzstd project <https://github.com/Rogdham/pyzstd>`_. |
| 134 | +This project was chosen as the code was `originally written to be upstreamed <https://github.com/python/cpython/issues/81276#issuecomment-1093824963>`_ |
| 135 | +to CPython by Ma Lin, who also wrote the `output buffer implementation used in |
| 136 | +the standard library today <https://github.com/python/cpython/commit/f9bedb630e8a0b7d94e1c7e609b20dfaa2b22231>`_. |
| 137 | +The project has since been taken over by Rogdham and is published to PyPI. The |
| 138 | +APIs in ``pyzstd`` are similar to the APIs for other compression modules in the |
| 139 | +standard library such as ``bz2`` and ``lzma``. |
| 140 | + |
| 141 | +Minimum supported Zstandard version |
| 142 | +----------------------------------- |
| 143 | + |
| 144 | +The minimum supported Zstandard was chosen as v1.4.5, released in May of 2020. |
| 145 | +This version was chosen as a minimum based on reviewing the versions of |
| 146 | +Zstandard available in a number of Linux distribution package repositories, |
| 147 | +including LTS releases. This version choice is rather conservative to maximize |
| 148 | +compatibility with existing LTS Linux distributions, but a newer Zstandard |
| 149 | +version could likely be chosen given that newer Python releases are generally |
| 150 | +packaged as part of newer distribution releases. |
| 151 | + |
| 152 | +Specification |
| 153 | +============= |
| 154 | + |
| 155 | +The ``compression`` namespace |
| 156 | +----------------------------- |
| 157 | + |
| 158 | +A new namespace package for compression modules will be introduced named |
| 159 | +``compression``. The top-level module for this package will be empty to begin |
| 160 | +with, but a standard API for interacting with compression routines may be |
| 161 | +added in the future to the toplevel. |
| 162 | + |
| 163 | +The ``compression.zstd`` module |
| 164 | +------------------------------- |
| 165 | + |
| 166 | +A new module, ``compression.zstd`` will be introduced with Zstandard |
| 167 | +compression APIs that match other compression modules in the standard library, |
| 168 | +namely |
| 169 | + |
| 170 | +* ``compress`` / ``decompress`` - APIs for one-shot compression/decompression |
| 171 | +* ``ZstdFile`` / ``open`` - APIs for interacting with streams and file-like |
| 172 | + objects |
| 173 | +* ``ZstdCompressor`` / ``ZstdDecompressor`` - APIs for incremental compression/ |
| 174 | + decompression |
| 175 | + |
| 176 | +It will also contain some Zstandard-specific functionality |
| 177 | + |
| 178 | +* ``ZstdDict`` / ``train_dict`` / ``finalize_dict`` - APIs for interacting with |
| 179 | + Zstandard dictionaries, which are useful for compressing many small chunks of |
| 180 | + similar data |
| 181 | + |
| 182 | +``libzstd`` optional dependency |
| 183 | +------------------------------- |
| 184 | + |
| 185 | +The ``libzstd`` library will become an optional dependency of CPython. If the |
| 186 | +library is not available, the ``compression.zstd`` module will be unavailable. |
| 187 | +This is handled automatically on Unix platforms as part of the normal build |
| 188 | +environment detection. |
| 189 | + |
| 190 | +On Windows, ``libzstd`` will be added to |
| 191 | +`the source dependencies <https://github.com/python/cpython-source-deps>`_ |
| 192 | +used to build libraries CPython depends on for Windows. |
| 193 | + |
| 194 | +Other compression modules |
| 195 | +------------------------- |
| 196 | + |
| 197 | +New import names ``compression.lzma``, ``compression.bz2``, and |
| 198 | +``compression.zlib`` will be introduced in Python 3.14 re-exporting the |
| 199 | +contents of the existing ``lzma``, ``bz2``, and ``zlib`` modules respectively. |
| 200 | + |
| 201 | +The ``_compression`` module, given that it is marked private, will be |
| 202 | +immediately renamed to ``compression._common.streams``. The new name was |
| 203 | +selected due to the current contents of the module being I/O related helpers |
| 204 | +for stream APIs (e.g. ``LZMAFile``) in standard library compression modules. |
| 205 | + |
| 206 | +Compression module migration timeline |
| 207 | +------------------------------------- |
| 208 | + |
| 209 | +Existing modules will emit a ``DeprecationWarning`` in the Python |
| 210 | +release following the last Python without the ``compression`` module leaving |
| 211 | +support. For example, if the ``compression`` namespace is introduced in 3.14, |
| 212 | +then the ``DeprecationWarnings`` would be emitted in 3.19, the next release |
| 213 | +after 3.13 reaches end of life. Following the standard deprecation timeline |
| 214 | +specified in :pep:`387`, in Python 3.24 the existing modules will be removed |
| 215 | +and code must use the ``compression`` sub-modules. The documentation for these |
| 216 | +modules will be updated to discuss the planned deprecation and removal |
| 217 | +timelines. |
| 218 | + |
| 219 | + |
| 220 | +Backwards Compatibility |
| 221 | +======================= |
| 222 | + |
| 223 | +The main compatibility concern is usage of existing standard library |
| 224 | +compression APIs with the existing import names. These names will be |
| 225 | +deprecated in 3.19 and will be removed in 3.24. Given the long coexistance of |
| 226 | +the modules and a 5 year deprecation period, most users will likely migrate to |
| 227 | +the new import names well before then. Additionally, a libCST codemod can be |
| 228 | +provided to automatically rewrite imports, easing the migration. |
| 229 | + |
| 230 | +Security Implications |
| 231 | +===================== |
| 232 | + |
| 233 | +As with any new C code, especially code operating on potentially untrusted user |
| 234 | +input, there are risks of memory safety issues. The author plans on |
| 235 | +contributing integration with libfuzzer to enable fuzzing the ``_zstd`` code |
| 236 | +and ensure it is robust. Furthermore, there are a number of tests that exercise |
| 237 | +the compression and decompression routines. These tests pass without error when |
| 238 | +compiled with AddressSanitizer. |
| 239 | + |
| 240 | +Taking on a new dependency also always has security risks, but the ``zstd`` |
| 241 | +library is mature, fuzzed on each commit, and `participates in Meta's bug bounty |
| 242 | +program <https://github.com/facebook/zstd/blob/dev/SECURITY.md>`_. |
| 243 | + |
| 244 | +How to Teach This |
| 245 | +================= |
| 246 | + |
| 247 | +Documentation for the new module is in the reference implementation branch. The |
| 248 | +documentation for other modules will be updated to discuss the deprecation of |
| 249 | +their existing import names, and how to migrate. |
| 250 | + |
| 251 | +Reference Implementation |
| 252 | +======================== |
| 253 | + |
| 254 | +The `reference implementation <https://github.com/emmatyping/cpython/tree/zstd>`_ |
| 255 | +contains the ``_zstd`` C code, the ``compression.zstd`` code, modifications to |
| 256 | +``tarfile``, ``shutil``, and ``zipfile``, and tests for each new API and |
| 257 | +integration added. It also contains the re-exports of other compression |
| 258 | +modules. Deprecations for the existing import names will be added once a |
| 259 | +decision is reached regarding the open issues. |
| 260 | + |
| 261 | +Rejected Ideas |
| 262 | +============== |
| 263 | + |
| 264 | +Name the module ``libzstd`` and do not make a new ``compression`` namespace |
| 265 | +--------------------------------------------------------------------------- |
| 266 | + |
| 267 | +One option instead of making a new ``compression`` namespace would be to find |
| 268 | +a different name, such as ``libzstd``, as the import name. However, the issue |
| 269 | +of existing import names is likely to persist for future compression formats |
| 270 | +added to the standard library. LZ4, a common high speed compression format, |
| 271 | +has `a package on PyPI <https://pypi.org/project/lz4/>`_, ``lz4``, with the |
| 272 | +import name ``lz4``. Instead of solving this issue for each compression format, |
| 273 | +it is better to solve it once and for all by using the already-claimed |
| 274 | +``compression`` namespace. |
| 275 | + |
| 276 | +Copyright |
| 277 | +========= |
| 278 | + |
| 279 | +This document is placed in the public domain or under the |
| 280 | +CC0-1.0-Universal license, whichever is more permissive. |
0 commit comments