8000 bpo-37935: Improve performance of pathlib.scandir() by ShaiAvr · Pull Request #15331 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

bpo-37935: Improve performance of pathlib.scandir() #15331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

bpo-37935: Improve performance of pathlib.scandir() #15331

wants to merge 1 commit into from

Conversation

ShaiAvr
Copy link
@ShaiAvr ShaiAvr commented Aug 19, 2019

I recently have taken a look in the source code of the pathlib module, and I saw something weird there:

when the module used the scandir function, it first converted its iterator into a list and then used it in a for loop. The list wasn't used anywhere else, so I think the conversion to list is just a waste of performance.

In addition, I noticed that the scandir iterator is never closed (it's not used in a with statement and its close method isn't called). I know that the iterator is closed automatically when it's garbaged collected, but according to the docs, it's advisable to close it explicitly.

So, I changed the code so that, scandir is used in a with statement and its iterator is used directly in the for loop instead of being converted to list first, which should enhance the performance.

https://bugs.python.org/issue37935

When the scandir function is used in the pathlib module, it's first converted into a list and then
used in a for loop. The conversion to list is unnecessary since the list isn't used except for the iteration
in the loop, so it's a waste of performance. I got rid of the conversions to list and used the scandir iterator
directly in the loop. In addition, I wrapped the use of scandir in the with statement to close its resources properly.
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

Copy link
Contributor
@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ShaiAvr and welcome!

I have a few recommendations for you:

  1. Create an account on https://bugs.python.org/ and sign the CLA. Usually it takes around 24 hours from the time it's submitted, and it may have to be checked manually for the status to be properly updated on GitHub. This must be done for any contribution to Python for legal purposes.

  2. Since this is a non-trivial change, open an issue on https://bugs.python.org/. This acts as an effective way to bring attention to the topic, and for core developers that are knowledgeable with pathlib to provide feedback. After the issue is created, you can attach the PR to it by changing the PR title to bpo-<issue-number>: <PR title>.

  3. Adjust the PR title to Improve performance of pathlib.scandir(). This helps the relevant core developers find the PR more easily and quickly identify what it's addressing.

  4. A news entry can also be submitted via blurb-it. This is usually a 1-2 sentence summary of what the PR changes. Optionally at the end, you can include "Patch by <name>". It's especially recommended for any PRs involving code changes (or adding new sections to docs).

Let me know if you have any questions, and I'll do my best to address them. (:

@aeros aeros added the performance Performance or resource usage label Aug 21, 2019
@ShaiAvr
Copy link
Author
ShaiAvr commented Aug 23, 2019

@aeros167. Thanks for your comment. It's the first time I do something like this and I appreciate your help.

I have created an account and signed the CLA like you said and I'm now working on creating the issue.
I have a question about the issue: which version of python I should choose when I create the issue?

This issue isn't specific for any version of python (any version that supports pathlib, at least) so should I use 3.7 because it's the latest release, 3.9 or something else?

Thank you for your help. I really appreciate it.

@aeros
Copy link
Contributor
aeros commented Aug 24, 2019

I have a question about the issue: which version of python I should choose when I create the issue?

Since this is a performance enhancement, I would start with version 3.9 and see what the others think about backporting the PR the previous branches. Most changes are applied to the latest development version (rather than the latest release). If the PR is approved and there's no significant conflicts, there's a decent chance that the backport to 3.8 will be approved.

However, it will likely not be backported to 3.7, unless the release manager decides otherwise. For that branch, it's pretty much only documentation corrections and security fixes.

Thank you for your help. I really appreciate it.

No problem, let me know if you have any other questions. (:

We'll have to wait for a core developer to review the PR before we can move forward with deciding whether or not it should be merged, but I can help with most general questions or anything process related. Also, the "Lifecycle of a Pull Pequest" section of the devguide might be able to provide some useful information.

@ShaiAvr ShaiAvr changed the title Improve performance of scandir by not converting it to list bpo-37935: Improve performance of pathlib.scandir() Aug 24, 2019
@gpshead
Copy link
Member
gpshead commented Sep 10, 2019

The user has signed the CLA: But they need to update their bugs.python.org "Your Details" page to list their ShaiAvr github user name for our bot to understand that. https://bugs.python.org/user32163

@gpshead
Copy link
Member
gpshead commented Sep 10, 2019

I'm considering this a performance bugfix, thus both tags. it is always an antipattern to call list(scandir(...)) as it defeats one of the main purposes of scandir: limiting memory use when reading huge directories.

@gpshead gpshead self-assigned this Sep 10, 2019
@gpshead
Copy link
Member
gpshead commented Sep 12, 2019

Can you update your branch with the latest changes from master?

@gpshead gpshead removed needs backport to 3.8 type-bug An unexpected behavior, bug, or error labels Sep 12, 2019
@csabella
Copy link
Contributor
< 8000 /div>

@ShaiAvr, please address the review comments. Thank you!

@csabella
Copy link
Contributor

I'm going to close this as the pull request appears to have been abandoned. It can be reopened or a new pull request can be opened to replace it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0