Fix ResponseStoreCleaner bugs causing unbounded response store growth a 10BC0 nd cleanup failures#17622
Conversation
d2744d7 to
2689a29
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17622 +/- ##
============================================
+ Coverage 63.19% 63.23% +0.04%
Complexity 1499 1499
============================================
Files 3174 3174
Lines 190259 190303 +44
Branches 29072 29079 +7
============================================
+ Hits 120234 120346 +112
+ Misses 60690 60621 -69
- Partials 9335 9336 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes critical bugs in the ResponseStoreCleaner that caused unbounded response store growth and cleanup failures on brokers.
Changes:
- Fixed URL accumulation bug by moving
brokerUrlsdeclaration inside the broker loop to prevent cross-broker request contamination - Increased GET timeout from 3s to 60s and added separate DELETE timeout of 10s to handle large response stores
- Made cleanup idempotent by treating 404 responses as success during delete operations
- Improved error isolation with separate try-catch blocks and partial failure tolerance to prevent single broker failures from blocking cleanup of other brokers
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| ResponseStoreCleaner.java | Fixed URL scoping bug, split timeouts, added 404 handling, improved error isolation and logging |
| ResponseStoreCleanerTest.java | Added comprehensive test coverage for cleanup scenarios including multi-broker isolation, 404 handling, and partial failures |
pinot-controller/src/main/java/org/apache/pinot/controller/cursors/ResponseStoreCleaner.java
Outdated
Show resolved
Hide resolved
...t-controller/src/test/java/org/apache/pinot/controller/cursors/ResponseStoreCleanerTest.java
Show resolved
Hide resolved
...t-controller/src/test/java/org/apache/pinot/controller/cursors/ResponseStoreCleanerTest.java
Show resolved
Hide resolved
pinot-controller/src/main/java/org/apache/pinot/controller/cursors/ResponseStoreCleaner.java
Outdated
Show resolved
Hide resolved
pinot-controller/src/main/java/org/apache/pinot/controller/cursors/ResponseStoreCleaner.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Overall lgtm, since the delete candidates could be large, we could also try to stream the response list and process/delete them in batch.
557511b to
b634aba
Compare
|
That's a good suggestion but I think that will be a larger refactor than this targeted bugfix, and I've created a separate issue for it as a future enhancement - #17643. |
Summary
Fixes critical bugs in
ResponseStoreCleanerthat caused:Problem
The
ResponseStoreCleanerperiodic task had several issues preventing effective cleanup:URL accumulation bug: The
brokerUrlslist was declared outside the broker iteration loop but never cleared between iterations. This caused:Insufficient timeout: The 3-second timeout was too short for brokers with large response stores, as
getAllStoredResponses()must read and deserialize every response file.404 treated as failure: Delete operations returning 404 were treated as errors, but for idempotent cleanup, a missing response means the goal is already achieved.
Single broker failure blocked all cleanup: An exception from any broker stopped processing of all remaining brokers.
Changes
brokerUrlsdeclaration inside the broker loop so each broker's URLs are processed independentlyTest Plan
CursorIntegrationTestpasses