8000 gh-86768: check if fd is seekable in os.lseek on Windows by aisk · Pull Request #133137 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-86768: check if fd is seekable in o 8000 s.lseek on Windows #133137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

aisk
Copy link
Contributor
@aisk aisk commented Apr 29, 2025

This change will introduce a performance regression:

import os
import timeit
f = open("a.py")
print(timeit.timeit("os.lseek(f.fileno(), 1, os.SEEK_CUR)", number=100000000, globals=globals()))
f.close()

Before the change:

PS C:\Users\xxxxx\Source\cpython> .\python.bat a.py
Running Release|x64 interpreter...
93.8409629999951

After the change:

PS C:\Users\xxxxx\Source\cpython> .\python.bat a.py
Running Release|x64 interpreter...
123.18093929998577

However I think it's acceptable because we added a check in the implementation and os.lseek usually won't been called too many times in the real world.

@aisk aisk marked this pull request as ready for review April 29, 2025 16:01
Comment on lines 11446 to 11454
if (result >= 0) {
LARGE_INTEGER distance, newdistance;
distance.QuadPart = position;
if (SetFilePointerEx(h, distance, &newdistance, how)) {
result = newdistance.QuadPart;
} else {
result = -1;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply use _lseeki64() after checking GetFileType()? It looks much simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Windows's CRT will call _get_osfhandle to convert the POSIX's fd to Windows's handle internally, which we've already called, so use SetFilePointerEx with the handle directly will gain some performance benefit. But I run the small benchmark script which mentioned at the top of this PR, and didn't see any noticeable performance regret, so updated to use _lseeki64 directly.

Comment on lines 11462 to 11464
if (errno == 0) {
errno = winerror_to_errno(GetLastError());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use PyErr_SetFromWindowsErr(0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetFilePointerEx will set the error to GetLastError instead of errno, so we should check it. But after changed to call _lseeki64 directly, there is no need for this line of code.

@aisk aisk changed the title gh-86768: implement os.lseek with SetFilePointer on Windows gh-86768: check if fd is seekable in os.lseek on Windows May 10, 2025
Copy link
Member
@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be slightly simpler if set result to -1 initially.

@@ -219,6 +219,7 @@
# if defined(MS_WINDOWS_DESKTOP) || defined(MS_WINDOWS_SYSTEM)
# define HAVE_SYMLINK
# endif /* MS_WINDOWS_DESKTOP | MS_WINDOWS_SYSTEM */
extern int winerror_to_errno(int);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer used.

Copy link
Member
@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

@morotti
Copy link
Contributor
morotti commented Jun 27, 2025

hello,

is it the function that is called when doing file.seek() on a file with open("myfile.zip", "r") as f

if yes, I am somewhat concerned about the performance impact of adding a syscall on every call. is it really needed?
the PR doesn't explain why the check is needed? what happened before when it wasn't explicitly checked whether the file is seekable and it was not seekable?

seek is very heavily used by applications and it is performance sensitive.
for example when you install a package with pip install, the python package is merely a zip file to extract.
the zip file can contain thousands of files, the extraction is doing multiple seek+read operations for each file to extract, to locate (multiple) headers and the content.
I wonder if these calls to seek are going to os.lseek?
https://github.com/python/cpython/blob/3.13/Lib/zipfile/__init__.py

@serhiy-storchaka
Copy link
Member

For example, you can write a ZIP file to not seekable file. If the output file is seekable, ZipFile seeks back to the local file header to write the compressed and uncompressed file sizes after writing the file data. If the output file is not seekable, ZipFile marks them as "undefined" in the local file header and use other way to write this metainformation. If seekable() falsely returns True and seek() does not have effect for pipe, ZipFile will produce corrupted ZIP file.

It is unfortunate that this change will add 0.3 microseconds for each seek() (this is 0.3 milliseconds for a thousand of files), but we have no other way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0