8000 MNT Enforce ruff/Perflint rules (PERF) by DimitriPapadopoulos · Pull Request #30693 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MNT Enforce ruff/Perflint rules (PERF) #30693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

DimitriPapadopoulos
Copy link
Contributor
@DimitriPapadopoulos DimitriPapadopoulos commented Jan 21, 2025

What does this implement/fix? Explain your changes.

Enforce Perflint (PERF) rules.

Any other comments?

While these are micro-optimisations, the whole process is automated by the linter.

[doc skip]

Copy link
github-actions bot commented Jan 21, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4872b71. Link to the linter CI: here

@DimitriPapadopoulos DimitriPapadopoulos force-pushed the PERF branch 2 times, most recently from 84fde55 to 82751c8 Compare January 21, 2025 23:08
@DimitriPapadopoulos DimitriPapadopoulos marked this pull request as draft January 21, 2025 23:44
@DimitriPapadopoulos DimitriPapadopoulos changed the title MAINT Enforce ruff/Perflint rules (PERF) MNT Enforce ruff/Perflint rules (PERF) Jan 21, 2025
@jeremiedbb
Copy link
Member

Thanks for the PR @DimitriPapadopoulos. The impact of these rules essentially makes us use list comprehension whenever possible and use dict.keys() and dict.values() when we don't need both. I'm +1 to adopt it.

I can imagine that in some contexts where performance doesn't matter that using list comprehension to build a list may hurt readability but looking at the diff here I didn't find it much less readable anywhere.

@jeremiedbb
Copy link
Member

There is a CI failure though :/

@DimitriPapadopoulos
Copy link
Contributor Author

Yes, it's still a draft.

@betatim
Copy link
Member
betatim commented Jan 29, 2025

I'm not super excited about the rewriting as list comprehension part. Does anyone have a benchmark (even microbenchmark) that shows how much this improves things? For me, the first few changes in this PR (only looked at a few) read easier if you see the for blah in foo, it gives a bit of context before reading the loop body.

@DimitriPapadopoulos
Copy link
Contributor Author
DimitriPapadopoulos commented Mar 11, 2025

I understand you have doubts about these rules:

The speed gains are reported to be ~ 10 % in the documentation. I must admit I haven't seen the benchmarks, but then I don't have reasons to be suspicious: another rule, that does have adverse effect on performance, has been identified and will be has been deprecated:

However, these rules are not only about speed:
pypa/setuptools#4449 (comment)

If you want, I can get rid of PERF401 and PERF403 for now, and keep only:
incorrect-dict-iterator (PERF102)

@DimitriPapadopoulos DimitriPapadopoulos force-pushed the PERF branch 2 times, most recently from cd423c7 to ef59a57 Compare March 11, 2025 17:14
@DimitriPapadopoulos DimitriPapadopoulos marked this pull request as ready for review March 11, 2025 17:14
@adrinjalali
Copy link
Member

Seems like they're not all properly fixed:

    output = [
>       super().transform(X[batch].toarray())
        for batch in gen_batches(
            n_samples, self.batch_size_, min_batch_size=self.n_components or 0
        )
    ]
E   TypeError: super(type, obj): obj must be an instance or subtype of type

.0         = <generator object gen_batches at 0x7fa3752b57b0>
X          = <150x4 sparse array of type '<class 'numpy.float64'>'
	with 600 stored elements in List of Lists format>
__class__  = <class 'sklearn.decomposition._incremental_pca.IncrementalPCA'>
batch      = slice(0, 50, None)

../1/s/sklearn/decomposition/_incremental_pca.py:414: TypeError

@DimitriPapadopoulos
Copy link
Contributor Author
DimitriPapadopoulos commented Mar 18, 2025

The mypy error is unrelated as far as I can understand.

@adrinjalali
Copy link
Member

But it's somehow triggered by the diff in this PR it seems.

@DimitriPapadopoulos
Copy link
Contributor Author

Wasn't it triggered before? In any case, I think I would need to silence mypy with a # type: ignore here.

@DimitriPapadopoulos
Copy link
Contributor Author

Perhaps it's time to upgrade mypy from 1.9.0 to current 1.15.0:


but then I suspect the Azure pipelines already run the latest mypy:
"mypy>=1.9",

@DimitriPapadopoulos DimitriPapadopoulos force-pushed the PERF branch 3 times, most recently from 30f4d9b to 1d39546 Compare March 18, 2025 17:51
@DimitriPapadopoulos
Copy link
Contributor Author

I'll come back to fixing the skipped PERF401 fix later on, once ruff, black, mypy have been updated.

@betatim
Copy link
Member
betatim commented Mar 19, 2025

There are quite a lot of PRs in flight now that all change formatting and linting rules. How are we keeping track of which commits need to be excluded from git blame? I saw Adrin commented on a previous PR that this needs doing, but I've not yet seen that follow up PR (or missed it).

It seems like this is snowballing, what started as a few PRs adding more linting rules is now a whole zoo of PRs changing formatters, linting rules, versions of tools, etc with diffs that are massive. However, it is unclear what the actual problem is that we are trying to solve. Having more linting rules enabled is not a goal, there has to be a reason for doing it. Similarly, someone somewhere on the internet saying they prefer list comprehensions everywhere is not a reason for changing it in scikit-learn. Just as a point to consider, most code that is in scikit-learn was written and reviewed by probably half a century of collective Python programming experience. That is a lot of experience and knowledge.

In conclusion, I am not against having tools help keep our code nice (I love black and how consistent code is formatted or how ruff takes care of sorting import statements) but I think we need to understand what the problem is that we are trying to solve. That is what is unclear to me right now.

@DimitriPapadopoulos
Copy link
Contributor Author
DimitriPapadopoulos commented Mar 19, 2025

I was planning on updating .git-blame-ignore-revs once the dust has settled, to avoid too many PRs - there are already enough of them as it is. This was meant as a courtesy, that's what other maintainers prefer. Happy to adapt to local preferences.

EDIT: See #31026 for a draft PR, ignoring commits that have already been merged.

@DimitriPapadopoulos
Copy link
Contributor Author
DimitriPapadopoulos commented Mar 19, 2025

As for the problems my current PRs are trying to solve:

  1. Updating the tools to recent versions from time to time is a good thing. You cannot stay forever on old versions that are going to become deprecated or are already slightly obsolete. That's a short term goal.
  2. What's wrong with adding a few additional rulesets? Following your logic, "adding a linter is not a goal" and a linter is not needed anyway because of that "half a century of collective Python programming experience". New rules do find real issues from time to time (such as 5c69fac) and globally improve the code. I thought this could be seen as a short-term goal, but I might be wrong and can agree with that.
    I think one real problem is that there's currently not way to just add a linter rule (a small change that can be easily reviewed) and have a trusted process run the linter to automatically apply fixes, thus lowering the probability of malicious manual code changes hidden in what appears to be automatic changes. The few remaining manual changes could be reviewed more easily. A workaround could be to have a trusted maintainer, if such a thing exists, apply automatic linter fixes - linter errors remain possible but at least the probability of malicious changes is considerably lower.
  3. I have been explicitly asked to change from black to ruff format.
  4. Some aspects of CI are currently a bit of a mess:
    • Tools such as black, ruff and mypy run in multiple pipelines. This just consumes CPU for no valid reason.
    • The selection of the version of the above tools is opaque and depends on each pipeline. This results in errors detected in some CI pipelines but not locally, or the reverse, and more generally inconsistencies. Using a single recent version of black, ruff and mypy is sufficient.
    • A typical example of the above is that the pre-commit configuration differs in terms of excluded files and versions of black, ruff and mypy. One possible solution would be to run these tools using pre-commit locally and pre-commit.ci in CI, but I am of course open to other ideas.
  5. I have been explicitly asked to fix pre-commit, which currently fails because of the above inconsistencies.
  6. ...

I guess item 5 4 warrants a dedicated issue or even a discussion, but I am not there yet. Other items look short term goals - and I was explicitly asked to fix some of them .

@DimitriPapadopoulos
Copy link
Contributor Author
DimitriPapadopoulos commented Mar 19, 2025

There are quite a lot of PRs in flight now that all change formatting and linting rules.

Also most of my current PRs don't do that. They fix the inconsistencies between pre-commit and all the different CI pipelines, fix letfovers from recent PRs such as #30895, etc. I feel it's unfair to categorize my recent PRs as "change formatting and linting rules". I suspect you haven't actually looked into them, but instead made you opinion based on a single PR.

@ogrisel
Copy link
Member
ogrisel commented Mar 21, 2025

I share the feeling that the value added by this particular PR is little. I don't think any of those changes can be considered a meaningful performance improvements (e.g. more than 1% improvement of time it takes to call the top level public API), as a list iteration vs list comprehension is probably insignificant compared to the actual computation done in numpy/scipy/cython function calls of those classes and functions.

But as Loïc said, looking at the diff, I could not find a case where the new code is significantly less readable. But the new code sometimes feel a bit more complicated than the old code for no valid reason (e.g. the .update with a dict comprehension). In other case, it does provide a slight readability improvement but this is very subjective.

So -0 on this particular PR.

But I appreciate the efforts on keeping the tooling modern (e.g. to benefit from speed improvements of ruff vs black), more consistent between CI runners and local setup, and simpler to configure.

@DimitriPapadopoulos
Copy link
Contributor Author
DimitriPapadopoulos commented Mar 21, 2025

As I wrote, these are micro-optimisations. They don't have much impact compared to the actual computation.

In theory, formatters and linters are supposed to result in globally faster or more readable and maintainable code in the long term, despite the short-term annoyance of having to adapt to new rules and language features. That's of course not always the case in practice, see for example astral-sh/ruff#7871 about UP038 which eventually got deprecated. Also, depending on the software, performance might be improved very marginally. Nevertheless, formatters and linters are supposed to automatically keep the codebase globally more predictable, current and readable in the long term. Perhaps it's best to embrace that trend, let them handle low-level stuff, and apply our creativity to higher-level tasks.

EDIT: Feel free to close this PR if needed 😄

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the diff, I like what this rule is doing. So I'm +1 on this. And it doesn't really affect our daily work, and we have a contributor who's doing the work introducing them. So I don't see why we'd resist inclusion of these.

Comment on lines +199 to +202
dfs.extend(
pd.DataFrame(data, columns=columns_names, copy=False)[columns_to_keep]
for data in chunk_generator(arff_container["data"], chunksize)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the changes in this PR are insignificant. But in places like this one, it can actually add quite a bit of value.

Therefore I don't mind having these rules enabled. In reality, once this PR is merged, we won't have much of an issue with the rule itself, since it affects a tiny portion of the code we write, and very often for the better.

for name, TreeEstimator in ALL_TREES.items():
for TreeEstimator in ALL_TREES.values():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are also a nice catch.

PERF102 When using only the values of a dict use the `values()` method
PERF401 Use a list comprehension to create a transformed list
PERF401 Use `list.extend` to create a transformed list
PERF403 Use a dictionary comprehension instead of a for-loop
@DimitriPapadopoulos
Copy link
Contributor Author

Rebased to resolve conflicts.

@lesteve
Copy link
Member
lesteve commented Mar 24, 2025

Same feeling on this one as @betatim and @ogrisel. IMO, this is typically the kind of PR that unfortunately doesn't bring enough value to justify the review time.

@DimitriPapadopoulos
Copy link
Contributor Author

Then do I close this PR?

@lesteve
Copy link
Member
lesteve commented Mar 24, 2025

Then do I close this PR?

Personally I would be in favor of closing this PR indeed.

@adrinjalali
Copy link
Member

Ok, let's close this one then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0