[core] Make renderers output files in deterministic order even when multithreaded #5593

oowekyala · 2025-03-13T13:32:03Z

Describe the PR

When running PMD with multiple threads and using an AbstractIncrementingRenderer (such as text, xml, sarif, etc), the order of the files in the report depends on thread scheduling. This makes reports not identical from one run to the next when using multithreaded processing.

This PR fixes this issue by buffering reports for individual files and outputting them in a predetermined order. The buffer is flushed as soon as possible and not only at the end of the analysis. This makes the buffering pretty lightweight and preserves the "incremental" nature of those renderers, meaning they output the report in real time instead of needing the entire analysis output to start writing. In one run I ran PMD on 14000 classes and collected 700 violations. During the whole run the buffer size did not exceed 150 Report instances which I consider a very good result.

Context: this is I think a prerequisite to enabling multithreading by default in the CLI. Note that it is already enabled by default in PMDConfiguration.

Related issues

Ready?

Added unit tests for fixed bug/feature
Passing all unit tests
Complete build ./mvnw clean verify passes (checked automatically by github actions)
Added (in-code) documentation (if needed)

…eaded

github-actions · 2025-03-13T14:04:05Z

Documentation Preview

Compared to main:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.

Regression Tester Report

Pankraz76

a lot of work - nicely done.

pmd-core/src/main/java/net/sourceforge/pmd/PmdAnalysis.java

pmd-core/src/main/java/net/sourceforge/pmd/reporting/CloseHookFileListener.java

Pankraz76 · 2025-03-22T13:56:37Z

pmd-core/src/main/java/net/sourceforge/pmd/reporting/DeterministicOutputListenerWrapper.java

+    private int nextToOutput;
+    private final Object lock = new Object();


Suggested change

private int nextToOutput;

private final Object lock = new Object();

private final Lock lock = new ReentrantLock();

private int nextToOutput;

maybe use different lock: as seen here: https://github.com/spring-projects/spring-framework/pull/34538/files

non final last

Thanks for the suggestion. However, we currently don't use Virtual Threads, we use platform threads for executing the analysis tasks (

pmd/pmd-core/src/main/java/net/sourceforge/pmd/lang/impl/MultiThreadProcessor.java

Line 32 in bc5be45

executor = Executors.newFixedThreadPool(task.getThreadCount(), new PmdThreadFactory());

)

Changing the synchronized blocks to ReentrantLocks wouldn't help and looks like a premature optimization. Switching to virtual threads also requires us to move to Java 21... When we do this (I guess, eventually PMD evolves in that way, but it will take time), we should do this in a separate PR and also do performance measures to actually see whether we would benefit from virtual threads and/or reentrant locks or not. The synchronized block here is not the only one...

Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...

This might be helpful to do the migration at a glance, and then only have to communicate and sync it properly. Certainly, the CI is not included here, but code-wise, this should be manageable, right? Or are there a lot of breaking changes that aren't covered by the rewrite?

Useful Resources:

Migrate to Java 21

Migrate to Java 17

If this truly is a Java virtual threads thing, then it should be covered by some kind of recipe, as this seems to be mandatory to migrate to the new concurrency goodness. Java 24 will be a tremendous performance improvement, and there the virtual threads future will work properly—no more thread pinning issues.

Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...

@adangel

I don't think PMD can benefit from virtual threads. Each parallel task (processing of a file) needs too much memory (to build the AST eg), and the IO waiting is mainly waits from disk loads, which is 1. fast, so there is not much time for context switching and 2. limited in bandwidth, as in, if you open too many files for reading concurrently, you will overload the OS and the memory bus of your system. So there is no way around pooling a small number of threads and limiting the number of concurrent tasks, which is not the programming model of virtual threads. If you profile PMD in a multithreaded run, however, you will find that CPU utilization is very high. We already use all the parallelism we can, so there is also nothing we can gain from virtual threads in terms of parallelism.

Virtual threads were just not designed for applications like ours. They were designed for things like web servers which might be waiting a really long time for a response from a subtask and have nothing to do in between, so you might as well process another request in the meantime. And each waiting subtask requires very little memory, so you can spawn millions if you want.

In summary, if we wanted to use the virtual threads programming model we would spawn one vthread per file to analyze. But that leads to terrible peak memory usage (as in, program crashes with OOM error), overloads OS resources, and also does not increase effective parallelism.

(I had actually tried using those with a Loom prototype, and was disappointed, until I figured out we are not the target audience)

what could potentially be beneficial for us is actually splitting away IO operations to a dedicated thread so as to remove all contingency on the disk. Thankfully for us, IO is actually the first step, so we could potentially move to a single thread read a file into memory => submit to a working queue to actually parse and analyze the file.

How much of an impact this would make I'm unsure, specially as SSDs become the de-facto standard.

Virtual threads probably won't help much due to the earlier mentioned memory limitations, so their impact may not be as significant as expected.

The guaranteed benefit is saving OS resources, which is not a top priority but still a win. This is virtual threads' primary promise besides being seamless to integrate, which turned out to be faulty in the first place. Thread-pinning issues are promised to be resolved in the upcoming Java 24 release. This was not a surprise for such a high-tier topic.

Virtual threads help avoid OS thread usage by pooling virtual threads into actual OS threads. In some edge cases on small systems, this might provide speed improvements since, without virtual threads, processes would just wait when no native thread is available. Virtual threads could potentially release resources for other uses, though this is highly speculative.

adangel

Thanks!

pmd-core/src/main/java/net/sourceforge/pmd/reporting/ListenerInitializer.java

pmd-core/src/main/java/net/sourceforge/pmd/PmdAnalysis.java

pmd-core/src/main/java/net/sourceforge/pmd/reporting/CloseHookFileListener.java

pmd-core/src/main/java/net/sourceforge/pmd/reporting/DeterministicOutputListenerWrapper.java

adangel · 2025-03-27T09:43:31Z

pmd-core/src/main/java/net/sourceforge/pmd/reporting/DeterministicOutputListenerWrapper.java

+    private int nextToOutput;
+    private final Object lock = new Object();


Thanks for the suggestion. However, we currently don't use Virtual Threads, we use platform threads for executing the analysis tasks (

pmd/pmd-core/src/main/java/net/sourceforge/pmd/lang/impl/MultiThreadProcessor.java

Line 32 in bc5be45

executor = Executors.newFixedThreadPool(task.getThreadCount(), new PmdThreadFactory());

)

Changing the synchronized blocks to ReentrantLocks wouldn't help and looks like a premature optimization. Switching to virtual threads also requires us to move to Java 21... When we do this (I guess, eventually PMD evolves in that way, but it will take time), we should do this in a separate PR and also do performance measures to actually see whether we would benefit from virtual threads and/or reentrant locks or not. The synchronized block here is not the only one...

Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...

adangel · 2025-03-27T09:52:56Z

pmd-core/src/main/java/net/sourceforge/pmd/reporting/DeterministicOutputListenerWrapper.java

+                        tryToFlushBuffer();
+                    } else {
+                        // if we shouldn't output this report yet, insert it in sorted order for later
+                        ListIterator<ReportWrapper> iter = reportBuffer.listIterator();


It might be easier to use a TreeSet here. But not sure about memory impact compared to a linkedlist.
With a treeset, we could simply add here the new reportwrapper.
And later in tryToFlushBuffer() iterate through it.

But I'd rather not change it now, it could be a future improvement.

That's true but I think with a linked list you can efficiently remove elements as you iterate through it with ListIterator. I don't think you can do that with a TreeSet. Not that it's a performance critical section anyway... but as it's written it might be more efficient.

…ultithreaded (#5593) Merge pull request #5593 from oowekyala:make-renderers-deterministic

oowekyala added 3 commits March 13, 2025 14:06

Make renderers output files in deterministic order even when multithr…

deebd79

…eaded

add test

a1acd4f

Lint

7a07790

oowekyala added 2 commits March 13, 2025 15:14

Don't forget closing listener

5052499

Rename method

1f37ab8

Pankraz76 approved these changes Mar 22, 2025

View reviewed changes

adangel self-requested a review March 25, 2025 19:25

adangel added the an:enhancement An improvement on existing features / rules label Mar 27, 2025

adangel added this to the 7.12.0 milestone Mar 27, 2025

adangel approved these changes Mar 27, 2025

View reviewed changes

adangel added a commit that referenced this pull request Mar 27, 2025

[doc] Update release notes (#5593)

63b4031

adangel added a commit that referenced this pull request Mar 27, 2025

[core] Make renderers output files in deterministic order even when m…

120c2bb

…ultithreaded (#5593) Merge pull request #5593 from oowekyala:make-renderers-deterministic

adangel merged commit 1f37ab8 into pmd:main Mar 27, 2025
19 of 20 checks passed

oowekyala deleted the make-renderers-deterministic branch March 27, 2025 20:31

oowekyala mentioned this pull request May 14, 2025

[cli] Make CLI default to multithreaded #5741

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[core] Make renderers output files in deterministic order even when multithreaded #5593

[core] Make renderers output files in deterministic order even when multithreaded #5593

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		private int nextToOutput;
		private final Object lock = new Object();

Uh oh!

[core] Make renderers output files in deterministic order even when multithreaded #5593

[core] Make renderers output files in deterministic order even when multithreaded #5593

Uh oh!

Conversation

Describe the PR

Related issues

Ready?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Useful Resources:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!