8000 [core] Make renderers output files in deterministic order even when multithreaded by oowekyala · Pull Request #5593 · pmd/pmd · GitHub
[go: up one dir, main page]

Skip to content

[core] Make renderers output files in deterministic order even when multithreaded #5593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 27, 2025

Conversation

oowekyala
Copy link
Member

Describe the PR

When running PMD with multiple threads and using an AbstractIncrementingRenderer (such as text, xml, sarif, etc), the order of the files in the report depends on thread scheduling. This makes reports not identical from one run to the next when using multithreaded processing.

This PR fixes this issue by buffering reports for individual files and outputting them in a predetermined order. The buffer is flushed as soon as possible and not only at the end of the analysis. This makes the buffering pretty lightweight and preserves the "incremental" nature of those renderers, meaning they output the report in real time instead of needing the entire analysis output to start writing. In one run I ran PMD on 14000 classes and collected 700 violations. During the whole run the buffer size did not exceed 150 Report instances which I consider a very good result.

Context: this is I think a prerequisite to enabling multithreading by default in the CLI. Note that it is already enabled by default in PMDConfiguration.

Related issues

Ready?

  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

Copy link
github-actions bot commented Mar 13, 2025

Documentation Preview

Compared to main:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.

Regression Tester Report

Copy link
Contributor
@Pankraz76 Pankraz76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of work - nicely done.

Comment on lines +48 to +49
private int nextToOutput;
private final Object lock = new Object();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private int nextToOutput;
private final Object lock = new Object();
private final Lock lock = new ReentrantLock();
private int nextToOutput;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. However, we currently don't use Virtual Threads, we use platform threads for executing the analysis tasks (

executor = Executors.newFixedThreadPool(task.getThreadCount(), new PmdThreadFactory());
)

Changing the synchronized blocks to ReentrantLocks wouldn't help and looks like a premature optimization. Switching to virtual threads also requires us to move to Java 21... When we do this (I guess, eventually PMD evolves in that way, but it will take time), we should do this in a separate PR and also do performance measures to actually see whether we would benefit from virtual threads and/or reentrant locks or not. The synchronized block here is not the only one...

Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be helpful to do the migration at a glance, and then only have to communicate and sync it properly. Certainly, the CI is not included here, but code-wise, this should be manageable, right? Or are there a lot of breaking changes that aren't covered by the rewrite?

Useful Resources:

Copy link
Contributor
@Pankraz76 Pankraz76 Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this truly is a Java virtual threads thing, then it should be covered by some kind of recipe, as this seems to be mandatory to migrate to the new concurrency goodness. Java 24 will be a tremendous performance improvement, and there the virtual threads future will work properly—no more thread pinning issues.

Copy link
Member Author
@oowekyala oowekyala Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...

@adangel

I don't think PMD can benefit from virtual threads. Each parallel task (processing of a file) needs too much memory (to build the AST eg), and the IO waiting is mainly waits from disk loads, which is 1. fast, so there is not much time for context switching and 2. limited in bandwidth, as in, if you open too many files for reading concurrently, you will overload the OS and the memory bus of your system. So there is no way around pooling a small number of threads and limiting the number of concurrent tasks, which is not the programming model of virtual threads. If you profile PMD in a multithreaded run, however, you will find that CPU utilization is very high. We already use all the parallelism we can, so there is also nothing we can gain from virtual threads in terms of parallelism.

Virtual threads were just not designed for applications like ours. They were designed for things like web servers which might be waiting a really long time for a response from a subtask and have nothing to do in between, so you might as well process another request in the meantime. And each waiting subtask requires very little memory, so you can spawn millions if you want.

In summary, if we wanted to use the virtual threads programming model we would spawn one vthread per file to analyze. But that leads to terrible peak memory usage (as in, program crashes with OOM error), overloads OS resources, and also does not increase effective parallelism.

(I had actually tried using those with a Loom prototype, and was disappointed, until I figured out we are not the target audience)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what could potentially be beneficial for us is actually splitting away IO operations to a dedicated thread so as to remove all contingency on the disk. Thankfully for us, IO is actually the first step, so we could potentially move to a single thread read a file into memory => submit to a working queue to actually parse and analyze the file.

How much of an impact this would make I'm unsure, specially as SSDs become the de-facto standard.

Copy link
Contributor
@Pankraz76 Pankraz76 Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Virtual threads probably won't help much due to the earlier mentioned memory limitations, so their impact may not be as significant as expected.

The guaranteed benefit is saving OS resources, which is not a top priority but still a win. This is virtual threads' primary promise besides being seamless to integrate, which turned out to be faulty in the first place. Thread-pinning issues are promised to be resolved in the upcoming Java 24 release. This was not a surprise for such a high-tier topic.

Virtual threads help avoid OS thread usage by pooling virtual threads into actual OS threads. In some edge cases on small systems, this might provide speed improvements since, without virtual threads, processes would just wait when no native thread is available. Virtual threads could potentially release resources for other uses, though this is highly speculative.

@adangel adangel self-requested a review March 25, 2025 19:25
@adangel adangel added the an:enhancement An improvement on existing features / rules label Mar 27, 2025
@adangel adangel added this to the 7.12.0 milestone Mar 27, 2025
Copy link
Member
@adangel adangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +48 to +49
private int nextToOutput;
private final Object lock = new Object();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. However, we currently don't use Virtual Threads, we use platform threads for executing the analysis tasks (

executor = Executors.newFixedThreadPool(task.getThreadCount(), new PmdThreadFactory());
)

Changing the synchronized blocks to ReentrantLocks wouldn't help and looks like a premature optimization. Switching to virtual threads also requires us to move to Java 21... When we do this (I guess, eventually PMD evolves in that way, but it will take time), we should do this in a separate PR and also do performance measures to actually see whether we would benefit from virtual threads and/or reentrant locks or not. The synchronized block here is not the only one...

Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...

tryToFlushBuffer();
} else {
// if we shouldn't output this report yet, insert it in sorted order for later
ListIterator<ReportWrapper> iter = reportBuffer.listIterator();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to use a TreeSet here. But not sure about memory impact compared to a linkedlist.
With a treeset, we could simply add here the new reportwrapper.
And later in tryToFlushBuffer() iterate through it.

But I'd rather not change it now, it could be a future improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true but I think with a linked list you can efficiently remove elements as you iterate through it with ListIterator. I don't think you can do that with a TreeSet. Not that it's a performance critical section anyway... but as it's written it might be more efficient.

adangel added a commit that referenced this pull request Mar 27, 2025
adangel added a commit that referenced this pull request Mar 27, 2025
…ultithreaded (#5593)

Merge pull request #5593 from oowekyala:make-renderers-deterministic
@adangel adangel merged commit 1f37ab8 into pmd:main Mar 27, 2025
19 of 20 checks passed
@oowekyala oowekyala deleted the make-renderers-deterministic branch March 27, 2025 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
an:enhancement An improvement on existing features / rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0