-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Don't fail task on timeout during cold shutdown #9678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9678 +/- ##
=======================================
Coverage 78.61% 78.62%
=======================================
Files 153 153
Lines 19172 19180 +8
Branches 2539 2542 +3
=======================================
+ Hits 15073 15080 +7
- Misses 3807 3811 +4
+ Partials 292 289 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider improving test coverage and if possible some integration tests? what do you think?
@daveisfera do you want to give this PR a test to confirm it works for you as well? |
I tested this fix and it appears to be handling it correctly now, but the Here's the output:
And here's the result saved in the database:
|
Thank you, I’ll check it out |
c034f3a
to
06dd9b3
Compare
For the sake of documentation, I tried this latest version ( |
6eecfaf
to
5a62a26
Compare
I’ve added a potential fix in 5a62a26 The idea is that canceling a successful task changes the status to RETRY, so avoiding it should fix this in theory. Thank you! |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
The status is still set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also consider some integration tests for this PR?
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR ensures that during a cold shutdown the worker skips normal timeout and failure handling to terminate quickly without logging errors or marking tasks as failed.
- Introduces a
should_terminate
flag set on cold shutdown to bypass error logging and backend failure marking inon_timeout
andon_failure
. - Updates
cancel_all_unacked_requests
to preserve tasks marked as successful. - Adds unit and smoke tests covering behavior when
should_terminate
is true.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
t/unit/worker/test_request.py | Adds tests for skipping error and backend calls on failure/timeout when terminating. |
t/unit/worker/test_consumer.py | Adds a test to ensure successful tasks are not cancelled. |
t/smoke/tests/test_worker.py | Adds smoke tests verifying tasks complete during soft shutdown. |
celery/worker/request.py | Guards error logging and backend failure marking behind should_terminate . |
celery/worker/consumer/consumer.py | Imports successful_requests and updates cancellation logic to skip successful tasks. |
celery/apps/worker.py | Sets state.should_terminate = True during cold shutdown. |
Comments suppressed due to low confidence (2)
t/smoke/tests/test_worker.py:238
- [nitpick] Class names should follow
PascalCase
. Renametest_time_limit
toTestTimeLimit
for consistency with other test classes.
class test_time_limit(SuiteOperations):
celery/worker/consumer/consumer.py:751
successful_requests
is a globalset
accessed without synchronization. Consider using a thread-safe collection or adding a lock to avoid potential race conditions when tasks are marked successful concurrently.
if request.id in successful_requests:
job.send_event.assert_not_called() | ||
job.task.backend.mark_as_failure.assert_not_called() | ||
finally: | ||
state.should_terminate = original_should_terminate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
original_should_terminate
is never initialized before being restored. Capture the previous value of state.should_terminate
(e.g., original_should_terminate = state.should_terminate
) before setting it to True
.
Copilot uses AI. Check for mistakes.
ughs... i shouldnt trust the copilot suggestion.... now cant revert |
WalkthroughThe changes update Celery's worker shutdown logic to ensure that tasks completed during a cold shutdown are not incorrectly marked as failed or timed out. The worker state now tracks termination explicitly, and both the consumer and request handling logic are adjusted to respect this state. New and updated tests verify correct task completion and result reporting during shutdown, including scenarios with time limits. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Worker
participant Consumer
participant Request
participant State
User->>Worker: Send SIGQUIT (soft shutdown)
Worker->>State: Set should_terminate = True
Worker->>Consumer: Cancel all unacked requests
Consumer->>State: Check successful_requests
Consumer->>Request: Only cancel unacked, non-successful requests
Worker->>Request: Stop pool, tasks finish
Request->>State: Check should_terminate
alt should_terminate is True
Request-->>Request: Skip failure/timeout reporting
else
Request-->>Request: Report failure/timeout as usual
end
Worker->>User: Exit after shutdown
Assessment against linked issues
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
t/unit/worker/test_request.py (1)
873-890
: Fix undefined variable in test cleanup.The variable
original_should_terminate
is used in thefinally
block but never defined. This will cause aNameError
when the test runs.Apply this fix to capture the original value before modification:
def test_on_failure_should_terminate(self): from celery.worker import state + original_should_terminate = state.should_terminate state.should_terminate = True job = self.xRequest() job.send_event = Mock(name='send_event') job.task.backend = Mock(name='backend') try: try: raise KeyError('foo') except KeyError: exc_info = ExceptionInfo() job.on_failure(exc_info) job.send_event.assert_not_called() job.task.backend.mark_as_failure.assert_not_called() finally: state.should_terminate = original_should_terminate
🧹 Nitpick comments (1)
celery/worker/consumer/consumer.py (1)
751-754
: LGTM: Logic correctly preserves successful tasks from cancellationThe conditional check properly prevents cancellation of requests that are marked as successful, even if they haven't been acknowledged yet. This is a key part of the fix to avoid incorrectly marking successful tasks as failed during shutdown.
The static analysis tool suggests simplifying the condition. While the current implementation is clear, you could consider:
- if request.id in successful_requests: - # Unless it was successful, in which case we don't want to cancel it. - return False + # Unless it was successful, in which case we don't want to cancel it. + return request.id not in successful_requestsHowever, the current approach is more explicit and readable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
celery/apps/worker.py
(1 hunks)celery/worker/consumer/consumer.py
(3 hunks)celery/worker/request.py
(2 hunks)t/smoke/tests/test_worker.py
(1 hunks)t/unit/worker/test_consumer.py
(2 hunks)t/unit/worker/test_request.py
(2 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
celery/worker/consumer/consumer.py
751-754: Return the negated condition directly
(SIM103)
t/unit/worker/test_request.py
890-890: Undefined name original_should_terminate
(F821)
🪛 GitHub Actions: Linter
t/unit/worker/test_request.py
[error] 890-890: flake8: undefined name 'original_should_terminate' (F821)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Analyze (python)
- GitHub Check: docker-build
- GitHub Check: smoke-tests_latest
- GitHub Check: Unit (3.12, windows-latest)
- GitHub Check: Unit (3.12, blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: Unit (3.10, blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: Unit (pypy-3.10, blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: Unit (3.11, blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: Unit (3.13, blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: Unit (3.8, windows-latest)
- GitHub Check: Unit (3.9, blacksmith-4vcpu-ubuntu-2404)
- GitHub Check: Unit (3.8, blacksmith-4vcpu-ubuntu-2404)
🔇 Additional comments (9)
celery/apps/worker.py (1)
415-416
: LGTM: Clean implementation of termination flag during cold shutdownThe addition of
state.should_terminate = True
is correctly placed and serves the core purpose of this PR. Setting this flag during cold shutdown allows other components to detect the termination state and avoid inappropriate error handling flows.celery/worker/consumer/consumer.py (1)
34-35
: LGTM: Added necessary import for successful_requests trackingThe import of
successful_requests
is correctly added to support the new functionality that preserves successful tasks from cancellation during shutdown.t/unit/worker/test_consumer.py (2)
22-22
: LGTM: Import added for test coverageThe import of
successful_requests
is correctly added to support the new test functionality.
476-494
: LGTM: Comprehensive test coverage for the new functionalityThe test effectively verifies that successful tasks are preserved from cancellation during shutdown. The test setup is thorough:
- Creates a mock request with
acks_late=True
andacknowledged=False
- Adds the request to both
active_requests
andsuccessful_requests
- Verifies that
cancel
is not called on the successful request- Includes proper cleanup to avoid test state leakage
This test ensures the refined cancellation logic works correctly and prevents regression.
celery/worker/request.py (2)
528-537
: Good implementation of timeout handling during cold shutdown.The conditional logic correctly prevents timeout errors from being logged and tasks from being marked as failed when the worker is terminating. This aligns with the PR objective to avoid failing tasks during cold shutdown.
622-626
: Correct suppression of failure handling during termination.The logic properly suppresses failure event sending and marks the return as OK when the worker is terminating, preventing inappropriate failure handling during cold shutdown.
t/unit/worker/test_request.py (1)
959-972
: Test logic is correct and follows proper cleanup pattern.This test correctly captures the original value and restores it in the finally block, ensuring proper test isolation.
t/smoke/tests/test_worker.py (2)
218-236
: Excellent test coverage for soft shutdown task completion.This test validates that tasks can complete successfully during soft shutdown, which directly supports the PR's objective of preventing incorrect task failures during worker termination.
247-278
: Comprehensive validation of shutdown behavior with time limits.This test thoroughly verifies that tasks complete successfully during soft shutdown even when time limits are configured, and confirms that timeout errors are not inappropriately triggered. The assertions for both success state and absence of timeout errors are particularly valuable.
What happened? And yes my friend, copilot is horribly unreliable, always double check yourself 🙏🙏 What can't/should be reverted? |
Also, who turned on coderabbit? The AI-based solutions aren't viable without proper adjustments, we should avoid using them without prior consensus. They may cause a lot of confusion and time wasting. We'll add back/configure our ecosystem accordingly, but it may cause more harm if not used wisely, and worse, cause confusion and possible bugs which is critical. Please take note 🙏 |
Fixes #9505
After the soft shutdown is over, the cold shutdown cannot honor the error-handling flow, as the worker must terminate as fast as possible so we may skip the timeout failure in that specific edge case.
Summary by CodeRabbit
Bug Fixes
Tests