-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MNT Enforce ruff/Perflint rules (PERF) #30693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
84fde55
to
82751c8
Compare
Thanks for the PR @DimitriPapadopoulos. The impact of these rules essentially makes us use list comprehension whenever possible and use dict.keys() and dict.values() when we don't need both. I'm +1 to adopt it. I can imagine that in some contexts where performance doesn't matter that using list comprehension to build a list may hurt readability but looking at the diff here I didn't find it much less readable anywhere. |
There is a CI failure though :/ |
Yes, it's still a draft. |
I'm not super excited about the rewriting as list comprehension part. Does anyone have a benchmark (even microbenchmark) that shows how much this improves things? For me, the first few changes in this PR (only looked at a few) read easier if you see the |
I understand you have doubts about these rules: The speed gains are reported to be ~ 10 % in the documentation. I must admit I haven't seen the benchmarks, but then I don't have reasons to be suspicious: another rule, that does have adverse effect on performance, has been identified and
However, these rules are not only about speed: If you want, I can get rid of |
cd423c7
to
ef59a57
Compare
Seems like they're not all properly fixed: output = [
> super().transform(X[batch].toarray())
for batch in gen_batches(
n_samples, self.batch_size_, min_batch_size=self.n_components or 0
)
]
E TypeError: super(type, obj): obj must be an instance or subtype of type
.0 = <generator object gen_batches at 0x7fa3752b57b0>
X = <150x4 sparse array of type '<class 'numpy.float64'>'
with 600 stored elements in List of Lists format>
__class__ = <class 'sklearn.decomposition._incremental_pca.IncrementalPCA'>
batch = slice(0, 50, None)
../1/s/sklearn/decomposition/_incremental_pca.py:414: TypeError |
ef59a57
to
d866609
Compare
The mypy error is unrelated as far as I can understand. |
But it's somehow triggered by the diff in this PR it seems. |
Wasn't it triggered before? In any case, I think I would need to silence mypy with a |
Perhaps it's time to upgrade mypy from 1.9.0 to current 1.15.0: scikit-learn/.pre-commit-config.yaml Line 19 in 239112a
but then I suspect the Azure pipelines already run the latest mypy: Line 89 in 239112a
|
30f4d9b
to
1d39546
Compare
I'll come back to fixing the skipped PERF401 fix later on, once ruff, black, mypy have been updated. |
There are quite a lot of PRs in flight now that all change formatting and linting rules. How are we keeping track of which commits need to be excluded from It seems like this is snowballing, what started as a few PRs adding more linting rules is now a whole zoo of PRs changing formatters, linting rules, versions of tools, etc with diffs that are massive. However, it is unclear what the actual problem is that we are trying to solve. Having more linting rules enabled is not a goal, there has to be a reason for doing it. Similarly, someone somewhere on the internet saying they prefer list comprehensions everywhere is not a reason for changing it in scikit-learn. Just as a point to consider, most code that is in scikit-learn was written and reviewed by probably half a century of collective Python programming experience. That is a lot of experience and knowledge. In conclusion, I am not against having tools help keep our code nice (I love black and how consistent code is formatted or how ruff takes care of sorting import statements) but I think we need to understand what the problem is that we are trying to solve. That is what is unclear to me right now. |
I was planning on updating EDIT: See #31026 for a draft PR, ignoring commits that have already been merged. |
As for the problems my current PRs are trying to solve:
I guess item |
Also most of my current PRs don't do that. They fix the inconsistencies between pre-commit and all the different CI pipelines, fix letfovers from recent PRs such as #30895, etc. I feel it's unfair to categorize my recent PRs as "change formatting and linting rules". I suspect you haven't actually looked into them, but instead made you opinion based on a single PR. |
I share the feeling that the value added by this particular PR is little. I don't think any of those changes can be considered a meaningful performance improvements (e.g. more than 1% improvement of time it takes to call the top level public API), as a list iteration vs list comprehension is probably insignificant compared to the actual computation done in numpy/scipy/cython function calls of those classes and functions. But as Loïc said, looking at the diff, I could not find a case where the new code is significantly less readable. But the new code sometimes feel a bit more complicated than the old code for no valid reason (e.g. the So -0 on this particular PR. But I appreciate the efforts on keeping the tooling modern (e.g. to benefit from speed improvements of ruff vs black), more consistent between CI runners and local setup, and simpler to configure. |
As I wrote, these are micro-optimisations. They don't have much impact compared to the actual computation. In theory, formatters and linters are supposed to result in globally faster or more readable and maintainable code in the long term, despite the short-term annoyance of having to adapt to new rules and language features. That's of course not always the case in practice, see for example astral-sh/ruff#7871 about UP038 which eventually got deprecated. Also, depending on the software, performance might be improved very marginally. Nevertheless, formatters and linters are supposed to automatically keep the codebase globally more predictable, current and readable in the long term. Perhaps it's best to embrace that trend, let them handle low-level stuff, and apply our creativity to higher-level tasks. EDIT: Feel free to close this PR if needed 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the diff, I like what this rule is doing. So I'm +1 on this. And it doesn't really affect our daily work, and we have a contributor who's doing the work introducing them. So I don't see why we'd resist inclusion of these.
dfs.extend( | ||
pd.DataFrame(data, columns=columns_names, copy=False)[columns_to_keep] | ||
for data in chunk_generator(arff_container["data"], chunksize) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the changes in this PR are insignificant. But in places like this one, it can actually add quite a bit of value.
Therefore I don't mind having these rules enabled. In reality, once this PR is merged, we won't have much of an issue with the rule itself, since it affects a tiny portion of the code we write, and very often for the better.
for name, TreeEstimator in ALL_TREES.items(): | ||
for TreeEstimator in ALL_TREES.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are also a nice catch.
PERF102 When using only the values of a dict use the `values()` method
PERF401 Use a list comprehension to create a transformed list PERF401 Use `list.extend` to create a transformed list
PERF403 Use a dictionary comprehension instead of a for-loop
Rebased to resolve conflicts. |
1d39546
to
4872b71
Compare
Then do I close this PR? |
Personally I would be in favor of closing this PR indeed. |
Ok, let's close this one then. |
What does this implement/fix? Explain your changes.
Enforce Perflint (PERF) rules.
Any other comments?
While these are micro-optimisations, the whole process is automated by the linter.
[doc skip]