8000 Fix ResponseStoreCleaner bugs causing unbounded response store growth and cleanup failures by yashmayya · Pull Request #17622 · apache/pinot · GitHub
[go: up one dir, main page]

Skip to content

Fix ResponseStoreCleaner bugs causing unbounded response store growth a 10BC0 nd cleanup failures#17622

Merged
yashmayya merged 2 commits intoapache:masterfrom
yashmayya:fix-response-store-cleaner-bugs
Feb 5, 2026
Merged

Fix ResponseStoreCleaner bugs causing unbounded response store growth and cleanup failures#17622
yashmayya merged 2 commits intoapache:masterfrom
yashmayya:fix-response-store-cleaner-bugs

Conversation

@yashmayya
Copy link
Contributor
@yashmayya yashmayya commented Feb 3, 2026

Summary

Fixes critical bugs in ResponseStoreCleaner that caused:

  • Response store directories on brokers to grow unbounded (100s of GBs) until disk exhaustion
  • Cleanup jobs to fail with timeout errors when response stores are large
  • Cascading 404 errors during delete operations due to duplicate requests

Problem

The ResponseStoreCleaner periodic task had several issues preventing effective cleanup:

  1. URL accumulation bug: The brokerUrls list was declared outside the broker iteration loop but never cleared between iterations. This caused:

    • Delete URLs to accumulate across all brokers
    • Requests being sent to wrong brokers (e.g., trying to delete broker1's responses via broker2)
    • Duplicate delete requests causing 404 errors
  2. Insufficient timeout: The 3-second timeout was too short for brokers with large response stores, as getAllStoredResponses() must read and deserialize every response file.

  3. 404 treated as failure: Delete operations returning 404 were treated as errors, but for idempotent cleanup, a missing response means the goal is already achieved.

  4. Single broker failure blocked all cleanup: An exception from any broker stopped processing of all remaining brokers.

Changes

  • Split timeouts: Increased GET timeout to 60s (from 3s) for listing operations; DELETE timeout set to 10s
  • Fixed URL scoping: Moved brokerUrls declaration inside the broker loop so each broker's URLs are processed independently
  • Idempotent deletes: 404 responses are now treated as success (response already deleted)
  • Partial failure tolerance: Cleanup continues for remaining brokers even if one fails; only throws if ALL brokers fail
  • Better error isolation: Separate try-catch blocks for auth, GET, and per-broker DELETE operations
  • Improved logging: Added structured logging for cleanup progress and partial failures

Test Plan

  • Verify existing CursorIntegrationTest passes
  • Manual testing with multiple brokers to verify per-broker URL isolation
  • Verify 404 responses during delete are logged at DEBUG level and don't cause failures
  • Test partial failure scenario (one broker down) to confirm other brokers are still cleaned

@yashmayya yashmayya force-pushed the fix-response-store-cleaner-bugs branch 3 times, most recently from d2744d7 to 2689a29 Compare February 3, 2026 19:23
@codecov-commenter
Copy link
codecov-commenter commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 80.82192% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.23%. Comparing base (7830751) to head (b634aba).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...pinot/controller/cursors/ResponseStoreCleaner.java 80.82% 12 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17622      +/-   ##
============================================
+ Coverage     63.19%   63.23%   +0.04%     
  Complexity     1499     1499              
============================================
  Files          3174     3174              
  Lines        190259   190303      +44     
  Branches      29072    29079       +7     
============================================
+ Hits         120234   120346     +112     
+ Misses        60690    60621      -69     
- Partials       9335     9336       +1     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.19% <80.82%> (+0.03%) ⬆️
java-21 63.20% <72.60%> (+0.05%) ⬆️
temurin 63.23% <80.82%> (+0.04%) ⬆️
unittests 63.23% <80.82%> (+0.04%) ⬆️
unittests1 55.60% <ø> (+<0.01%) ⬆️
unittests2 34.07% <80.82%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yashmayya yashmayya marked this pull request as ready for review February 3, 2026 20:34
@xiangfu0 xiangfu0 requested a review from Copilot February 5, 2026 05:01
Copy link
Contributor
Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes critical bugs in the ResponseStoreCleaner that caused unbounded response store growth and cleanup failures on brokers.

Changes:

  • Fixed URL accumulation bug by moving brokerUrls declaration inside the broker loop to prevent cross-broker request contamination
  • Increased GET timeout from 3s to 60s and added separate DELETE timeout of 10s to handle large response stores
  • Made cleanup idempotent by treating 404 responses as success during delete operations
  • Improved error isolation with separate try-catch blocks and partial failure tolerance to prevent single broker failures from blocking cleanup of other brokers

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
ResponseStoreCleaner.java Fixed URL scoping bug, split timeouts, added 404 handling, improved error isolation and logging
ResponseStoreCleanerTest.java Added comprehensive test coverage for cleanup scenarios including multi-broker isolation, 404 handling, and partial failures

@yashmayya yashmayya requested a review from xiangfu0 February 5, 2026 15:37
Copy link
Contributor
@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, since the delete candidates could be large, we could also try to stream the response list and process/delete them in batch.

@yashmayya
Copy link
Contributor Author

That's a good suggestion but I think that will be a larger refactor than this targeted bugfix, and I've created a separate issue for it as a future enhancement - #17643.

@yashmayya yashmayya merged commit 3f54660 into apache:master Feb 5, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

0