-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[core] Make renderers output files in deterministic order even when multithreaded #5593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Compared to main: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of work - nicely done.
pmd-core/src/main/java/net/sourceforge/pmd/reporting/CloseHookFileListener.java
Show resolved
Hide resolved
pmd-core/src/main/java/net/sourceforge/pmd/reporting/CloseHookFileListener.java
Show resolved
Hide resolved
private int nextToOutput; | ||
private final Object lock = new Object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private int nextToOutput; | |
private final Object lock = new Object(); | |
private final Lock lock = new ReentrantLock(); | |
private int nextToOutput; |
- maybe use different lock: as seen here: https://github.com/spring-projects/spring-framework/pull/34538/files
- non final last
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. However, we currently don't use Virtual Threads, we use platform threads for executing the analysis tasks (
pmd/pmd-core/src/main/java/net/sourceforge/pmd/lang/impl/MultiThreadProcessor.java
Line 32 in bc5be45
executor = Executors.newFixedThreadPool(task.getThreadCount(), new PmdThreadFactory()); |
Changing the synchronized blocks to ReentrantLocks wouldn't help and looks like a premature optimization. Switching to virtual threads also requires us to move to Java 21... When we do this (I guess, eventually PMD evolves in that way, but it will take time), we should do this in a separate PR and also do performance measures to actually see whether we would benefit from virtual threads and/or reentrant locks or not. The synchronized block here is not the only one...
Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be helpful to do the migration at a glance, and then only have to communicate and sync it properly. Certainly, the CI is not included here, but code-wise, this should be manageable, right? Or are there a lot of breaking changes that aren't covered by the rewrite?
Useful Resources:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this truly is a Java virtual threads thing, then it should be covered by some kind of recipe, as this seems to be mandatory to migrate to the new concurrency goodness. Java 24 will be a tremendous performance improvement, and there the virtual threads future will work properly—no more thread pinning issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...
I don't think PMD can benefit from virtual threads. Each parallel task (processing of a file) needs too much memory (to build the AST eg), and the IO waiting is mainly waits from disk loads, which is 1. fast, so there is not much time for context switching and 2. limited in bandwidth, as in, if you open too many files for reading concurrently, you will overload the OS and the memory bus of your system. So there is no way around pooling a small number of threads and limiting the number of concurrent tasks, which is not the programming model of virtual threads. If you profile PMD in a multithreaded run, however, you will find that CPU utilization is very high. We already use all the parallelism we can, so there is also nothing we can gain from virtual threads in terms of parallelism.
Virtual threads were just not designed for applications like ours. They were designed for things like web servers which might be waiting a really long time for a response from a subtask and have nothing to do in between, so you might as well process another request in the meantime. And each waiting subtask requires very little memory, so you can spawn millions if you want.
In summary, if we wanted to use the virtual threads programming model we would spawn one vthread per file to analyze. But that leads to terrible peak memory usage (as in, program crashes with OOM error), overloads OS resources, and also does not increase effective parallelism.
(I had actually tried using those with a Loom prototype, and was disappointed, until I figured out we are not the target audience)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what could potentially be beneficial for us is actually splitting away IO operations to a dedicated thread so as to remove all contingency on the disk. Thankfully for us, IO is actually the first step, so we could potentially move to a single thread read a file into memory => submit to a working queue to actually parse and analyze the file.
How much of an impact this would make I'm unsure, specially as SSDs become the de-facto standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Virtual threads probably won't help much due to the earlier mentioned memory limitations, so their impact may not be as significant as expected.
The guaranteed benefit is saving OS resources, which is not a top priority but still a win. This is virtual threads' primary promise besides being seamless to integrate, which turned out to be faulty in the first place. Thread-pinning issues are promised to be resolved in the upcoming Java 24 release. This was not a surprise for such a high-tier topic.
Virtual threads help avoid OS thread usage by pooling virtual threads into actual OS threads. In some edge cases on small systems, this might provide speed improvements since, without virtual threads, processes would just wait when no native thread is available. Virtual threads could potentially release resources for other uses, though this is highly speculative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
pmd-core/src/main/java/net/sourceforge/pmd/reporting/ListenerInitializer.java
Show resolved
Hide resolved
pmd-core/src/main/java/net/sourceforge/pmd/reporting/CloseHookFileListener.java
Show resolved
Hide resolved
pmd-core/src/main/java/net/sourceforge/pmd/reporting/DeterministicOutputListenerWrapper.java
Show resolved
Hide resolved
private int nextToOutput; | ||
private final Object lock = new Object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. However, we currently don't use Virtual Threads, we use platform threads for executing the analysis tasks (
pmd/pmd-core/src/main/java/net/sourceforge/pmd/lang/impl/MultiThreadProcessor.java
Line 32 in bc5be45
executor = Executors.newFixedThreadPool(task.getThreadCount(), new PmdThreadFactory()); |
Changing the synchronized blocks to ReentrantLocks wouldn't help and looks like a premature optimization. Switching to virtual threads also requires us to move to Java 21... When we do this (I guess, eventually PMD evolves in that way, but it will take time), we should do this in a separate PR and also do performance measures to actually see whether we would benefit from virtual threads and/or reentrant locks or not. The synchronized block here is not the only one...
Btw. switching to Java 21 - we could create a multi-release jar and make use of virtual threads, if PMD is run under Java 21. But again - that's a completely different topic...
tryToFlushBuffer(); | ||
} else { | ||
// if we shouldn't output this report yet, insert it in sorted order for later | ||
ListIterator<ReportWrapper> iter = reportBuffer.listIterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to use a TreeSet here. But not sure about memory impact compared to a linkedlist.
With a treeset, we could simply add here the new reportwrapper.
And later in tryToFlushBuffer() iterate through it.
But I'd rather not change it now, it could be a future improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true but I think with a linked list you can efficiently remove elements as you iterate through it with ListIterator. I don't think you can do that with a TreeSet. Not that it's a performance critical section anyway... but as it's written it might be more efficient.
Describe the PR
When running PMD with multiple threads and using an AbstractIncrementingRenderer (such as text, xml, sarif, etc), the order of the files in the report depends on thread scheduling. This makes reports not identical from one run to the next when using multithreaded processing.
This PR fixes this issue by buffering reports for individual files and outputting them in a predetermined order. The buffer is flushed as soon as possible and not only at the end of the analysis. This makes the buffering pretty lightweight and preserves the "incremental" nature of those renderers, meaning they output the report in real time instead of needing the entire analysis output to start writing. In one run I ran PMD on 14000 classes and collected 700 violations. During the whole run the buffer size did not exceed 150 Report instances which I consider a very good result.
Context: this is I think a prerequisite to enabling multithreading by default in the CLI. Note that it is already enabled by default in PMDConfiguration.
Related issues
Ready?
./mvnw clean verify
passes (checked automatically by github actions)