MNT Enforce ruff/Perflint rules (PERF) #30693

DimitriPapadopoulos · 2025-01-21T22:40:45Z

What does this implement/fix? Explain your changes.

Enforce Perflint (PERF) rules.

Any other comments?

While these are micro-optimisations, the whole process is automated by the linter.

[doc skip]

github-actions · 2025-01-21T22:42:02Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4872b71. Link to the linter CI: here}

jeremiedbb · 2025-01-23T13:28:29Z

Thanks for the PR @DimitriPapadopoulos. The impact of these rules essentially makes us use list comprehension whenever possible and use dict.keys() and dict.values() when we don't need both. I'm +1 to adopt it.

I can imagine that in some contexts where performance doesn't matter that using list comprehension to build a list may hurt readability but looking at the diff here I didn't find it much less readable anywhere.

jeremiedbb · 2025-01-23T13:28:52Z

jeremiedbb commented

Jan 23, 2025

There is a CI failure though :/

DimitriPapadopoulos · 2025-01-29T10:35:38Z

Yes, it's still a draft.

betatim · 2025-01-29T13:22:27Z

I'm not super excited about the rewriting as list comprehension part. Does anyone have a benchmark (even microbenchmark) that shows how much this improves things? For me, the first few changes in this PR (only looked at a few) read easier if you see the for blah in foo, it gives a bit of context before reading the loop body.

DimitriPapadopoulos · 2025-03-11T13:57:38Z

I understand you have doubts about these rules:

The speed gains are reported to be ~ 10 % in the documentation. I must admit I haven't seen the benchmarks, but then I don't have reasons to be suspicious: another rule, that does have adverse effect on performance, has been identified and ~~will be~~ has been deprecated:

UP038 rewrites code to make it slower and more verbose astral-sh/ruff#7871
0.10.0
Deprecated Rules

The following rules have been deprecated:
- non-pep604-isinstance (UP038)

However, these rules are not only about speed:
pypa/setuptools#4449 (comment)

If you want, I can get rid of PERF401 and PERF403 for now, and keep only:
incorrect-dict-iterator (PERF102)

adrinjalali · 2025-03-18T08:30:25Z

Seems like they're not all properly fixed:

    output = [
>       super().transform(X[batch].toarray())
        for batch in gen_batches(
            n_samples, self.batch_size_, min_batch_size=self.n_components or 0
        )
    ]
E   TypeError: super(type, obj): obj must be an instance or subtype of type

.0         = <generator object gen_batches at 0x7fa3752b57b0>
X          = <150x4 sparse array of type '<class 'numpy.float64'>'
	with 600 stored elements in List of Lists format>
__class__  = <class 'sklearn.decomposition._incremental_pca.IncrementalPCA'>
batch      = slice(0, 50, None)

../1/s/sklearn/decomposition/_incremental_pca.py:414: TypeError

DimitriPapadopoulos · 2025-03-18T12:54:48Z

The mypy error is unrelated as far as I can understand.

adrinjalali · 2025-03-18T13:47:18Z

But it's somehow triggered by the diff in this PR it seems.

DimitriPapadopoulos · 2025-03-18T14:29:06Z

Wasn't it triggered before? In any case, I think I would need to silence mypy with a # type: ignore here.

DimitriPapadopoulos · 2025-03-18T14:33:56Z

Perhaps it's time to upgrade mypy from 1.9.0 to current 1.15.0:

scikit-learn/.pre-commit-config.yaml

Line 19 in 239112a

rev: v1.9.0

but then I suspect the Azure pipelines already run the latest mypy:

scikit-learn/pyproject.toml

Line 89 in 239112a

"mypy>=1.9",

DimitriPapadopoulos · 2025-03-18T18:19:19Z

I'll come back to fixing the skipped PERF401 fix later on, once ruff, black, mypy have been updated.

betatim · 2025-03-19T09:31:28Z

There are quite a lot of PRs in flight now that all change formatting and linting rules. How are we keeping track of which commits need to be excluded from git blame? I saw Adrin commented on a previous PR that this needs doing, but I've not yet seen that follow up PR (or missed it).

It seems like this is snowballing, what started as a few PRs adding more linting rules is now a whole zoo of PRs changing formatters, linting rules, versions of tools, etc with diffs that are massive. However, it is unclear what the actual problem is that we are trying to solve. Having more linting rules enabled is not a goal, there has to be a reason for doing it. Similarly, someone somewhere on the internet saying they prefer list comprehensions everywhere is not a reason for changing it in scikit-learn. Just as a point to consider, most code that is in scikit-learn was written and reviewed by probably half a century of collective Python programming experience. That is a lot of experience and knowledge.

In conclusion, I am not against having tools help keep our code nice (I love black and how consistent code is formatted or how ruff takes care of sorting import statements) but I think we need to understand what the problem is that we are trying to solve. That is what is unclear to me right now.

DimitriPapadopoulos · 2025-03-19T09:50:13Z

I was planning on updating .git-blame-ignore-revs once the dust has settled, to avoid too many PRs - there are already enough of them as it is. This was meant as a courtesy, that's what other maintainers prefer. Happy to adapt to local preferences.

EDIT: See #31026 for a draft PR, ignoring commits that have already been merged.

DimitriPapadopoulos · 2025-03-19T10:09:38Z

As for the problems my current PRs are trying to solve:

Updating the tools to recent versions from time to time is a good thing. You cannot stay forever on old versions that are going to become deprecated or are already slightly obsolete. That's a short term goal.
What's wrong with adding a few additional rulesets? Following your logic, "adding a linter is not a goal" and a linter is not needed anyway because of that "half a century of collective Python programming experience". New rules do find real issues from time to time (such as 5c69fac) and globally improve the code. I thought this could be seen as a short-term goal, but I might be wrong and can agree with that.
I think one real problem is that there's currently not way to just add a linter rule (a small change that can be easily reviewed) and have a trusted process run the linter to automatically apply fixes, thus lowering the probability of malicious manual code changes hidden in what appears to be automatic changes. The few remaining manual changes could be reviewed more easily. A workaround could be to have a trusted maintainer, if such a thing exists, apply automatic linter fixes - linter errors remain possible but at least the probability of malicious changes is considerably lower.
I have been explicitly asked to change from black to ruff format.
Some aspects of CI are currently a bit of a mess:
- Tools such as black, ruff and mypy run in multiple pipelines. This just consumes CPU for no valid reason.
- The selection of the version of the above tools is opaque and depends on each pipeline. This results in errors detected in some CI pipelines but not locally, or the reverse, and more generally inconsistencies. Using a single recent version of black, ruff and mypy is sufficient.
- A typical example of the above is that the pre-commit configuration differs in terms of excluded files and versions of black, ruff and mypy. One possible solution would be to run these tools using pre-commit locally and pre-commit.ci in CI, but I am of course open to other ideas.
I have been explicitly asked to fix pre-commit, which currently fails because of the above inconsistencies.
...

I guess item 5 4 warrants a dedicated issue or even a discussion, but I am not there yet. Other items look short term goals - and I was explicitly asked to fix some of them .

DimitriPapadopoulos · 2025-03-19T11:54:42Z

There are quite a lot of PRs in flight now that all change formatting and linting rules.

Also most of my current PRs don't do that. They fix the inconsistencies between pre-commit and all the different CI pipelines, fix letfovers from recent PRs such as #30895, etc. I feel it's unfair to categorize my recent PRs as "change formatting and linting rules". I suspect you haven't actually looked into them, but instead made you opinion based on a single PR.

ogrisel · 2025-03-21T10:01:59Z

I share the feeling that the value added by this particular PR is little. I don't think any of those changes can be considered a meaningful performance improvements (e.g. more than 1% improvement of time it takes to call the top level public API), as a list iteration vs list comprehension is probably insignificant compared to the actual computation done in numpy/scipy/cython function calls of those classes and functions.

But as Loïc said, looking at the diff, I could not find a case where the new code is significantly less readable. But the new code sometimes feel a bit more complicated than the old code for no valid reason (e.g. the .update with a dict comprehension). In other case, it does provide a slight readability improvement but this is very subjective.

So -0 on this particular PR.

But I appreciate the efforts on keeping the tooling modern (e.g. to benefit from speed improvements of ruff vs black), more consistent between CI runners and local setup, and simpler to configure.

DimitriPapadopoulos · 2025-03-21T10:22:21Z

As I wrote, these are micro-optimisations. They don't have much impact compared to the actual computation.

In theory, formatters and linters are supposed to result in globally faster or more readable and maintainable code in the long term, despite the short-term annoyance of having to adapt to new rules and language features. That's of course not always the case in practice, see for example astral-sh/ruff#7871 about UP038 which eventually got deprecated. Also, depending on the software, performance might be improved very marginally. Nevertheless, formatters and linters are supposed to automatically keep the codebase globally more predictable, current and readable in the long term. Perhaps it's best to embrace that trend, let them handle low-level stuff, and apply our creativity to higher-level tasks.

EDIT: Feel free to close this PR if needed 😄

adrinjalali

Looking at the diff, I like what this rule is doing. So I'm +1 on this. And it doesn't really affect our daily work, and we have a contributor who's doing the work introducing them. So I don't see why we'd resist inclusion of these.

adrinjalali · 2025-03-24T10:04:24Z

sklearn/datasets/_arff_parser.py

+        dfs.extend(
+            pd.DataFrame(data, columns=columns_names, copy=False)[columns_to_keep]
+            for data in chunk_generator(arff_container["data"], chunksize)
+        )


most of the changes in this PR are insignificant. But in places like this one, it can actually add quite a bit of value.

Therefore I don't mind having these rules enabled. In reality, once this PR is merged, we won't have much of an issue with the rule itself, since it affects a tiny portion of the code we write, and very often for the better.

adrinjalali · 2025-03-24T10:05:25Z

sklearn/tree/tests/test_tree.py

-    for name, TreeEstimator in ALL_TREES.items():
+    for TreeEstimator in ALL_TREES.values():


these are also a nice catch.

PERF102 When using only the values of a dict use the `values()` method

PERF401 Use a list comprehension to create a transformed list PERF401 Use `list.extend` to create a transformed list

PERF403 Use a dictionary comprehension instead of a for-loop

DimitriPapadopoulos · 2025-03-24T13:20:28Z

Rebased to resolve conflicts.

lesteve · 2025-03-24T15:08:01Z

Same feeling on this one as @betatim and @ogrisel. IMO, this is typically the kind of PR that unfortunately doesn't bring enough value to justify the review time.

DimitriPapadopoulos · 2025-03-24T15:40:28Z

Then do I close this PR?

lesteve · 2025-03-24T16:27:00Z

Then do I close this PR?

Personally I would be in favor of closing this PR indeed.

adrinjalali · 2025-03-25T08:24:59Z

Ok, let's close this one then.

DimitriPapadopoulos force-pushed the PERF branch 2 times, most recently from 84fde55 to 82751c8 Compare January 21, 2025 23:08

DimitriPapadopoulos marked this pull request as draft January 21, 2025 23:44

DimitriPapadopoulos changed the title ~~MAINT Enforce ruff/Perflint rules (PERF)~~ MNT Enforce ruff/Perflint rules (PERF) Jan 21, 2025

DimitriPapadopoulos force-pushed the PERF branch 2 times, most recently from cd423c7 to ef59a57 Compare March 11, 2025 17:14

DimitriPapadopoulos marked this pull request as ready for review March 11, 2025 17:14

adrinjalali approved these changes Mar 18, 2025

View reviewed changes

DimitriPapadopoulos force-pushed the PERF branch from ef59a57 to d866609 Compare March 18, 2025 12:53

DimitriPapadopoulos force-pushed the PERF branch 3 times, most recently from 30f4d9b to 1d39546 Compare March 18, 2025 17:51

DimitriPapadopoulos mentioned this pull request Mar 19, 2025

MNT git ignore recent black/ruff updates #31026

Merged

adrinjalali approved these changes Mar 24, 2025

View reviewed changes

DimitriPapadopoulos added 5 commits March 24, 2025 14:19

MAINT Apply ruff/Perflint rule PERF102

06142d1

PERF102 When using only the values of a dict use the `values()` method

MAINT Apply ruff/Perflint rule PERF401

fbab698

PERF401 Use a list comprehension to create a transformed list PERF401 Use `list.extend` to create a transformed list

MAINT Apply ruff/Perflint rule PERF403

d19accd

PERF403 Use a dictionary comprehension instead of a for-loop

MAINT Enforce ruff/Perflint rules (PERF)

fcb485b

MAINT silence mypy

4872b71

DimitriPapadopoulos force-pushed the PERF branch from 1d39546 to 4872b71 Compare March 24, 2025 13:20

adrinjalali closed this Mar 25, 2025

DimitriPapadopoulos deleted the PERF branch March 25, 2025 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MNT Enforce ruff/Perflint rules (PERF) #30693

MNT Enforce ruff/Perflint rules (PERF) #30693

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Deprecated Rules

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		for name, TreeEstimator in ALL_TREES.items():
		for TreeEstimator in ALL_TREES.values():

Uh oh!

MNT Enforce ruff/Perflint rules (PERF) #30693

MNT Enforce ruff/Perflint rules (PERF) #30693

Uh oh!

Conversation

Uh oh!

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Deprecated Rules

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!