You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
zipfile.is_zipfile has false positives way too easily.
I just have seen it in practive when a MoinMoin wiki site with a lot of pdf attachments crashed with 500. This was caused by a valid PDF that just happened to contain PK\005\006 somewhere in the middle - this was enough to satisfy is_zipfile() and triggered further processing as a zipfile, which then crashed with IOError (which was not catched in our code, yet).
I have looked into zipfile code: if the usual EOCD structure (with empty comment) is not at EOF, it is suspected that there might be a non-empty comment and ~64K before EOF are searched for the PK\005\006 magic. If it is somewhere there, it is assumed that the file is a zip, without any further validity check.
Attached is a failure demo that works with at least 2.7 and 3.5.
It either does not contain a comment (this is what the existing code checks first) or it contains a comment of the length that is specified in the structure.
The patch checks consistency specified length vs. real length (end of fixed part of structure up to EOF). If this does not match, it is likely not a zip file, but just a file that happens to have the magic 4 bytes somewhere in its last 64kB.
The problem is that the zipfile module supports even not well-formed archives, with a data appended past a comment, and with truncated comment. There are special tests for this, and the proposed patch breaks these tests: test_comments, test_ignores_newline_at_end, test_ignores_stuff_appended_past_comments. See bpo-10694 and bpo-1622.
No, checking the first bytes of the file is not appropriate option. zipfile should support the Python zip application format [1].
I see two options:
Make is_zipfile() more strict that the ZipFile constructor. The later supports ZIP files with a data past the comment or with truncated comments, but the former should reject them.
Make both is_zipfile() and the ZipFile constructor more robust. They should check not just the EOCD signature, but check the Zip64 end of central directory record (if exists) and the first central file header signature (if the ZIP file is not empty).
It may be that PDF files contain PK\005\006 not accidentally, but because they contain embedded ZIP files (I don't know if this is a case). In that circumstances is_zipfile() returning True is correct.
Fix submitted that evaluates the ECD structure and validates the first CD entry. The fix also handles empty zipfiles.
IMO the purpose of this API is to *quickly* verify that the file is a valid zipfile. With this fix, the API only reads another 46 bytes of data (after a seek, of course). This should still qualify as "quick", especially after having potentially read 64k of data.
Perhaps a full zip validator would be appropriate in addition to is_zipfile. That would be more appropriate as a full feature rather than in this bugfix.
it's a bugfix, it seems reasonable for 3.7 to me. I agree that the previous is_zipfile check is too lenient. I'll follow up on jjolly's PR for any specific concerns I have with the implementation.
Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/test/test_zipfile.py", line 2502, in test_execute_zip2
output = subprocess.check_output([self.exe_zip, sys.executable])
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 411, in check_outputreturn run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 489, in runwith Popen(*popenargs, **kwargs) as process:
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 845, in __init__self._execute_child(args, executable, preexec_fn, close_fds,
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 1689, in _execute_childraise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ziptestdata/exe_with_zip'
Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/test/test_zipfile.py", line 2509, in test_execute_zip64
output = subprocess.check_output([self.exe_zip64, sys.executable])
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 411, in check_outputreturn run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 489, in runwith Popen(*popenargs, **kwargs) as process:
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 845, in __init__self._execute_child(args, executable, preexec_fn, close_fds,
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/subprocess.py", line 1689, in _execute_childraise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ziptestdata/exe_with_z64'
Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/test/test_zipfile.py", line 2496, in test_read_zip64_with_exe_prependedself._test_zip_works(self.exe_zip64)
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/test/test_zipfile.py", line 2484, in _test_zip_worksself.assertTrue(zipfile.is_zipfile(name),
AssertionError: False is not true : is_zipfile failed on ziptestdata/exe_with_z64
Traceback (most recent call last):
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/test/test_zipfile.py", line 2493, in test_read_zip_with_exe_prependedself._test_zip_works(self.exe_zip)
File "/buildbot/buildarea/cpython/3.x.ware-gentoo-x86.installed/build/target/lib/python3.9/test/test_zipfile.py", line 2484, in _test_zip_worksself.assertTrue(zipfile.is_zipfile(name),
AssertionError: False is not true : is_zipfile failed on ziptestdata/exe_with_zip
The new ziptestdata/ subdir appears to not be part of the install that make install does. :/
It seems like Lib/test/eintrdata/ (for example) is installed using LIBSUBDIRS variable in Makefile.pre.in.
Note: The Windows installer copies recursively Lib/test/ and subdirectories: see <InstallFiles Include="$(PySourcePath)Lib\test\**\*" ...> in Tools/msi/test/test.wixproj.
Any chance #5053 could be merged? A user of ours has a TIFF file, for which is_zipfile returns True, but calling the ZipFile constructor results in BadZipFile. I probably can not give out the file, but here's an example run:
Python 3.12.3 (main, Apr 17 2024, 00:00:00) [GCC 14.0.1 20240411 (Red Hat 14.0.1-0)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zipfile
>>> zipfile.is_zipfile("testfile.tif")
True
>>> zipfile.ZipFile("testfile.tif")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.12/zipfile/__init__.py", line 1349, in __init__
self._RealGetContents()
File "/usr/lib64/python3.12/zipfile/__init__.py", line 1435, in _RealGetContents
raise BadZipFile("Bad offset for central directory")
zipfile.BadZipFile: Bad offset for central directory
@vpv-csc: test_ignores_newline_at_end() and test_ignores_stuff_appended_past_comments() of test_zipfile fail with attached isz_fail_fix.diff. Someone should design a fix which doesn't break the test suite, or the test suite should be updated as well.
)
bpo-28494: Improve zipfile.is_zipfile reliability
The zipfile.is_zipfile function would only search for the EndOfZipfile
section header. This failed to correctly identify non-zipfiles that
contained this header. Now the zipfile.is_zipfile function verifies
the first central directory entry.
Changes:
* Extended zipfile.is_zipfile to verify zipfile catalog
* Added tests to validate failure of binary non-zipfiles
* Reuse 'concat' handling for is_zipfile
Co-authored-by: John Jolly <john.jolly@gmail.com>
…ythonGH-134250)
bpo-28494: Improve zipfile.is_zipfile reliability
The zipfile.is_zipfile function would only search for the EndOfZipfile
section header. This failed to correctly identify non-zipfiles that
contained this header. Now the zipfile.is_zipfile function verifies
the first central directory entry.
Changes:
* Extended zipfile.is_zipfile to verify zipfile catalog
* Added tests to validate failure of binary non-zipfiles
* Reuse 'concat' handling for is_zipfile
(cherry picked from commit 1298511)
Co-authored-by: Tim Hatch <timhatch@netflix.com>
Co-authored-by: John Jolly <john.jolly@gmail.com>
…H-134250) (#134401)
gh-72680: Fix false positives when using zipfile.is_zipfile() (GH-134250)
bpo-28494: Improve zipfile.is_zipfile reliability
The zipfile.is_zipfile function would only search for the EndOfZipfile
section header. This failed to correctly identify non-zipfiles that
contained this header. Now the zipfile.is_zipfile function verifies
the first central directory entry.
Changes:
* Extended zipfile.is_zipfile to verify zipfile catalog
* Added tests to validate failure of binary non-zipfiles
* Reuse 'concat' handling for is_zipfile
(cherry picked from commit 1298511)
Co-authored-by: Tim Hatch <timhatch@netflix.com>
Co-authored-by: John Jolly <john.jolly@gmail.com>
lkollar
pushed a commit
to lkollar/cpython
that referenced
this issue
May 26, 2025
…ythonGH-134250)
bpo-28494: Improve zipfile.is_zipfile reliability
The zipfile.is_zipfile function would only search for the EndOfZipfile
section header. This failed to correctly identify non-zipfiles that
contained this header. Now the zipfile.is_zipfile function verifies
the first central directory entry.
Changes:
* Extended zipfile.is_zipfile to verify zipfile catalog
* Added tests to validate failure of binary non-zipfiles
* Reuse 'concat' handling for is_zipfile
Co-authored-by: John Jolly <john.jolly@gmail.com>
Uh oh!
There was an error while loading. Please reload this page.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered:
zipfile.is_zipfile has false positives way too easily.
I just have seen it in practive when a MoinMoin wiki site with a lot of pdf attachments crashed with 500. This was caused by a valid PDF that just happened to contain PK\005\006 somewhere in the middle - this was enough to satisfy is_zipfile() and triggered further processing as a zipfile, which then crashed with IOError (which was not catched in our code, yet).
I have looked into zipfile code: if the usual EOCD structure (with empty comment) is not at EOF, it is suspected that there might be a non-empty comment and ~64K before EOF are searched for the PK\005\006 magic. If it is somewhere there, it is assumed that the file is a zip, without any further validity check.
Attached is a failure demo that works with at least 2.7 and 3.5.
https://en.wikipedia.org/wiki/Zip_(file_format)
patch for py2.7
The EOCD structure is at EOF.
It either does not contain a comment (this is what the existing code checks first) or it contains a comment of the length that is specified in the structure.
The patch checks consistency specified length vs. real length (end of fixed part of structure up to EOF). If this does not match, it is likely not a zip file, but just a file that happens to have the magic 4 bytes somewhere in its last 64kB.
Note: checking the first bytes of the file (PK..) might be another option.
But this has the "problem" that a self-extracting zip starts with an executable that has different first bytes.
So whether this is an option or not depends on whether is_zipfile() should return truish for self-extracting ZIP files.
The problem is that the zipfile module supports even not well-formed archives, with a data appended past a comment, and with truncated comment. There are special tests for this, and the proposed patch breaks these tests: test_comments, test_ignores_newline_at_end, test_ignores_stuff_appended_past_comments. See bpo-10694 and bpo-1622.
Well, if you have a better idea how to fix is_zipfile, go on.
I even suggested an alternative, how about that?
It is a miserable state when the is_zipfile function in the stdlib detects random crap as a zip file.
No, checking the first bytes of the file is not appropriate option. zipfile should support the Python zip application format [1].
I see two options:
Make is_zipfile() more strict that the ZipFile constructor. The later supports ZIP files with a data past the comment or with truncated comments, but the former should reject them.
Make both is_zipfile() and the ZipFile constructor more robust. They should check not just the EOCD signature, but check the Zip64 end of central directory record (if exists) and the first central file header signature (if the ZIP file is not empty).
It may be that PDF files contain PK\005\006 not accidentally, but because they contain embedded ZIP files (I don't know if this is a case). In that circumstances is_zipfile() returning True is correct.
[1] https://docs.python.org/3/library/zipapp.html
Fix submitted that evaluates the ECD structure and validates the first CD entry. The fix also handles empty zipfiles.
IMO the purpose of this API is to *quickly* verify that the file is a valid zipfile. With this fix, the API only reads another 46 bytes of data (after a seek, of course). This should still qualify as "quick", especially after having potentially read 64k of data.
Perhaps a full zip validator would be appropriate in addition to is_zipfile. That would be more appropriate as a full feature rather than in this bugfix.
Is there any chance that this will make it into 3.7 or is my reminder too late?
it's a bugfix, it seems reasonable for 3.7 to me. I agree that the previous is_zipfile check is too lenient. I'll follow up on jjolly's PR for any specific concerns I have with the implementation.
New changeset 3f4db4a by T. Wouters (Gregory P. Smith) in branch 'master':
bpo-28494: Test existing zipfile working behavior. (GH-15853)
3f4db4a
x86 Gentoo Installed with X 3.x buildbot is unhappy:
https://buildbot.python.org/all/#/builders/103/builds/3051
======================================================================
ERROR: test_execute_zip2 (test.test_zipfile.TestExecutablePrependedZip)
----------------------------------------------------------------------
======================================================================
ERROR: test_execute_zip64 (test.test_zipfile.TestExecutablePrependedZip)
----------------------------------------------------------------------
======================================================================
FAIL: test_read_zip64_with_exe_prepended (test.test_zipfile.TestExecutablePrependedZip)
----------------------------------------------------------------------
======================================================================
FAIL: test_read_zip_with_exe_prepended (test.test_zipfile.TestExecutablePrependedZip)
----------------------------------------------------------------------
The new ziptestdata/ subdir appears to not be part of the install that make install does. :/
It seems like Lib/test/eintrdata/ (for example) is installed using LIBSUBDIRS variable in Makefile.pre.in.
Note: The Windows installer copies recursively Lib/test/ and subdirectories: see <InstallFiles Include="$(PySourcePath)Lib\test\**\*" ...> in Tools/msi/test/test.wixproj.
New changeset 74b0291 by Miss Islington (bot) in branch '3.8':
bpo-28494: Test existing zipfile working behavior. (GH-15853)
74b0291
New changeset c374474 by Gregory P. Smith in branch 'master':
bpo-28494: install ziptestdata to fix install bot (GH-15902)
c374474
New changeset 7acb22e by Miss Islington (bot) in branch '3.8':
bpo-28494: install ziptestdata to fix install bot (GH-15902)
7acb22e
Any chance #5053 could be merged? A user of ours has a TIFF file, for which
is_zipfile
returnsTrue
, but calling theZipFile
constructor results inBadZipFile
. I probably can not give out the file, but here's an example run:@vpv-csc: test_ignores_newline_at_end() and test_ignores_stuff_appended_past_comments() of test_zipfile fail with attached isz_fail_fix.diff. Someone should design a fix which doesn't break the test suite, or the test suite should be updated as well.
TPASPKT-1337 Add is_zipfile function to utils
94bc6af
gh-72680: Fix false positives when using zipfile.is_zipfile() (GH-134250
1298511
pythongh-72680: Fix false positives when using zipfile.is_zipfile() (p…
4f037a2
Thanks for the PR work @jjolly and @thatch! Finally merged.
[3.14] gh-72680: Fix false positives when using zipfile.is_zipfile() (G…
af428a3
pythongh-72680: Fix false positives when using zipfile.is_zipfile() (p…
c10c7fd