8000 Don't fail task on timeout during cold shutdown by Nusnus · Pull Request #9678 · celery/celery · GitHub
[go: up one dir, main page]

Skip to content

Don't fail task on timeout during cold shutdown #9678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Nusnus
Copy link
Member
@Nusnus Nusnus commented Apr 29, 2025

Fixes #9505

After the soft shutdown is over, the cold shutdown cannot honor the error-handling flow, as the worker must terminate as fast as possible so we may skip the timeout failure in that specific edge case.

Summary by CodeRabbit

  • Bug Fixes

    • Improved worker shutdown handling to prevent redundant task failure reporting and backend updates during cold shutdown.
    • Enhanced cancellation logic to ensure successful but unacknowledged tasks are not canceled during shutdown or connection loss.
  • Tests

    • Added tests to verify tasks complete successfully during soft shutdown, including scenarios with task time limits.
    • Introduced unit tests to confirm correct suppression of failure and timeout handling when the worker is terminating.
    • Added tests to ensure successful tasks are preserved during cancellation of unacknowledged requests.

Copy link
codecov bot commented Apr 29, 2025

Codecov Report

Attention: Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 78.62%. Comparing base (9cb389d) to head (44205e6).

Files with missing lines Patch % Lines
celery/apps/worker.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9678   +/-   ##
=======================================
  Coverage   78.61%   78.62%           
=======================================
  Files         153      153           
  Lines       19172    19180    +8     
  Branches     2539     2542    +3     
=======================================
+ Hits        15073    15080    +7     
- Misses       3807     3811    +4     
+ Partials      292      289    -3     
Flag Coverage Δ
unittests 78.60% <83.33%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Nusnus Nusnus marked this pull request as ready for review April 29, 2025 00:51
@Nusnus Nusnus added this to the 5.5.3 milestone Apr 29, 2025
@auvipy auvipy requested a review from Copilot April 29, 2025 05:11
Copilot

This comment was marked as resolved.

Copy link
Member
@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider improving test coverage and if possible some integration tests? what do you think?

@Nusnus
Copy link
Member Author
Nusnus commented May 4, 2025

@daveisfera do you want to give this PR a test to confirm it works for you as well?

daveisfera added a commit to daveisfera/celery_cold_shutdown that referenced this pull request May 5, 2025
@daveisfera
Copy link
Contributor

@daveisfera do you want to give this PR a test to confirm it works for you as well?

I tested this fix and it appears to be handling it correctly now, but the status is still incorrectly set to RETRY even though the task finished.

Here's the output:

2025-05-05T16:32:25.134218717Z [2025-05-05 16:32:25,133: INFO/MainProcess] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] received
2025-05-05T16:32:25.135509561Z [2025-05-05 16:32:25,135: WARNING/ForkPoolWorker-1] Value: 8
2025-05-05T16:32:25.322559812Z 
2025-05-05T16:32:25.322610489Z worker: Hitting Ctrl+C again will terminate all running tasks!
2025-05-05T16:32:25.322829567Z [2025-05-05 16:32:25,322: WARNING/MainProcess] Initiating Soft Shutdown, terminating in 16 seconds
2025-05-05T16:32:33.145373138Z [2025-05-05 16:32:33,144: WARNING/ForkPoolWorker-1] Done: 8
2025-05-05T16:32:33.162053417Z [2025-05-05 16:32:33,161: INFO/ForkPoolWorker-1] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] succeeded in 8.02601714399998s: None
2025-05-05T16:32:41.381467890Z 
2025-05-05T16:32:41.381502893Z worker: Cold shutdown (MainProcess)

And here's the result saved in the database:

sqlite> SELECT * FROM django_celery_results_taskresult ORDER BY id DESC LIMIT 10;
1|562e669b-41f4-4e7b-a188-e36f71560b9f|RETRY|application/json|utf-8|{"exc_type": "Retry", "exc_message": ["Retry(Retry(...), None, None)", null, null], "exc_module": "celery.exceptions"}|2025-05-05 16:32:41.348411||{"children": []}|"(8,)"|"{}"|mysite.celery.long_task|2025-05-05 16:32:33.155238|celery@celery|

@Nusnus
Copy link
Member Author
Nusnus commented May 8, 2025

@daveisfera do you want to give this PR a test to confirm it works for you as well?

I tested this fix and it appears to be handling it correctly now, but the status is still incorrectly set to RETRY even though the task finished.

Here's the output:

2025-05-05T16:32:25.134218717Z [2025-05-05 16:32:25,133: INFO/MainProcess] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] received
2025-05-05T16:32:25.135509561Z [2025-05-05 16:32:25,135: WARNING/ForkPoolWorker-1] Value: 8
2025-05-05T16:32:25.322559812Z 
2025-05-05T16:32:25.322610489Z worker: Hitting Ctrl+C again will terminate all running tasks!
2025-05-05T16:32:25.322829567Z [2025-05-05 16:32:25,322: WARNING/MainProcess] Initiating Soft Shutdown, terminating in 16 seconds
2025-05-05T16:32:33.145373138Z [2025-05-05 16:32:33,144: WARNING/ForkPoolWorker-1] Done: 8
2025-05-05T16:32:33.162053417Z [2025-05-05 16:32:33,161: INFO/ForkPoolWorker-1] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] succeeded in 8.02601714399998s: None
2025-05-05T16:32:41.381467890Z 
2025-05-05T16:32:41.381502893Z worker: Cold shutdown (MainProcess)

And here's the result saved in the database:

sqlite> SELECT * FROM django_celery_results_taskresult ORDER BY id DESC LIMIT 10;
1|562e669b-41f4-4e7b-a188-e36f71560b9f|RETRY|application/json|utf-8|{"exc_type": "Retry", "exc_message": ["Retry(Retry(...), None, None)", null, null], "exc_module": "celery.exceptions"}|2025-05-05 16:32:41.348411||{"children": []}|"(8,)"|"{}"|mysite.celery.long_task|2025-05-05 16:32:33.155238|celery@celery|

Thank you, I’ll check it out

@Nusnus Nusnus force-pushed the hotfix branch 2 times, most recently from c034f3a to 06dd9b3 Compare May 18, 2025 19:40
@daveisfera
Copy link
Contributor

For the sake of documentation, I tried this latest version (06dd9b3) and still had the issue of recording the status as RETRY

@Nusnus Nusnus force-pushed the hotfix branch 2 times, most recently from 6eecfaf to 5a62a26 Compare May 31, 2025 19:45
@Nusnus Nusnus marked this pull request as draft May 31, 2025 19:54
@Nusnus
Copy link
Member Author
Nusnus commented May 31, 2025

@daveisfera

For the sake of documentation, I tried this latest version (06dd9b3) and still had the issue of recording the status as RETRY

I’ve added a potential fix in 5a62a26

The idea is that canceling a successful task changes the status to RETRY, so avoiding it should fix this in theory.
Can you please verify the latest version of this PR works as expected?

Thank you!

@Nusnus Nusnus marked this pull request as ready for review June 1, 2025 20:02
Copy link
Member
@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@daveisfera
Copy link
Contributor

I’ve added a potential fix in 5a62a26

The idea is that canceling a successful task changes the status to RETRY, so avoiding it should fix this in theory. Can you please verify the latest version of this PR works as expected?

The status is still set to RETRY even though the task completes successfully. That doesn't reflect what actually happened and seems "wrong", so why not set it to SUCCESS to match the actual result?

daveisfera added a commit to daveisfera/celery_cold_shutdown that referenced this pull request Jun 3, 2025
@auvipy auvipy modified the milestones: 5.5.3, 5.5.4 Jun 10, 2025
@auvipy auvipy requested a review from Copilot July 3, 2025 03:17
Copilot

This comment was marked as resolved.

Copy link
Member
@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also consider some integration tests for this PR?

@auvipy auvipy modified the milestones: 5.5.4, 5.6.0 Jul 3, 2025
auvipy and others added 3 commits July 7, 2025 11:32
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@auvipy auvipy requested a review from Copilot July 7, 2025 05:33
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR ensures that during a cold shutdown the worker skips normal timeout and failure handling to terminate quickly without logging errors or marking tasks as failed.

  • Introduces a should_terminate flag set on cold shutdown to bypass error logging and backend failure marking in on_timeout and on_failure.
  • Updates cancel_all_unacked_requests to preserve tasks marked as successful.
  • Adds unit and smoke tests covering behavior when should_terminate is true.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
t/unit/worker/test_request.py Adds tests for skipping error and backend calls on failure/timeout when terminating.
t/unit/worker/test_consumer.py Adds a test to ensure successful tasks are not cancelled.
t/smoke/tests/test_worker.py Adds smoke tests verifying tasks complete during soft shutdown.
celery/worker/request.py Guards error logging and backend failure marking behind should_terminate.
celery/worker/consumer/consumer.py Imports successful_requests and updates cancellation logic to skip successful tasks.
celery/apps/worker.py Sets state.should_terminate = True during cold shutdown.
Comments suppressed due to low confidence (2)

t/smoke/tests/test_worker.py:238

  • [nitpick] Class names should follow PascalCase. Rename test_time_limit to TestTimeLimit for consistency with other test classes.
        class test_time_limit(SuiteOperations):

celery/worker/consumer/consumer.py:751

  • successful_requests is a global set accessed without synchronization. Consider using a thread-safe collection or adding a lock to avoid potential race conditions when tasks are marked successful concurrently.
                if request.id in successful_requests:

job.send_event.assert_not_called()
job.task.backend.mark_as_failure.assert_not_called()
finally:
state.should_terminate = original_should_terminate
Copy link
Preview
Copilot AI Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original_should_terminate is never initialized before being restored. Capture the previous value of state.should_terminate (e.g., original_should_terminate = state.should_terminate) before setting it to True.

Copilot uses AI. Check for mistakes.

@auvipy
Copy link
Member
auvipy commented Jul 7, 2025

ughs... i shouldnt trust the copilot suggestion.... now cant revert

Copy link
coderabbitai bot commented Jul 14, 2025

Walkthrough

The changes update Celery's worker shutdown logic to ensure that tasks completed during a cold shutdown are not incorrectly marked as failed or timed out. The worker state now tracks termination explicitly, and both the consumer and request handling logic are adjusted to respect this state. New and updated tests verify correct task completion and result reporting during shutdown, including scenarios with time limits.

Changes

File(s) Change Summary
celery/apps/worker.py Sets should_terminate flag in worker state during cold shutdown to signal termination.
celery/worker/consumer/consumer.py Updates cancellation logic to preserve successful but unacknowledged requests; imports successful_requests.
celery/worker/request.py Modifies on_timeout and on_failure to bypass error reporting and backend updates if worker is terminating.
t/smoke/tests/test_worker.py Adds integration tests for task completion during soft shutdown, including with time limits.
t/unit/worker/test_consumer.py Adds unit test to verify successful tasks are not canceled during unacked request cancellation.
t/unit/worker/test_request.py Adds unit tests to verify failure and timeout handling are suppressed during worker termination.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Worker
    participant Consumer
    participant Request
    participant State

    User->>Worker: Send SIGQUIT (soft shutdown)
    Worker->>State: Set should_terminate = True
    Worker->>Consumer: Cancel all unacked requests
    Consumer->>State: Check successful_requests
    Consumer->>Request: Only cancel unacked, non-successful requests
    Worker->>Request: Stop pool, tasks finish
    Request->>State: Check should_terminate
    alt should_terminate is True
        Request-->>Request: Skip failure/timeout reporting
    else
        Request-->>Request: Report failure/timeout as usual
    end
    Worker->>User: Exit after shutdown
Loading

Assessment against linked issues

Objective Addressed Explanation
Tasks completed during a cold shutdown are not incorrectly marked as failed or timed out (#9505)
Tasks that complete on cold shutdown are reported as SUCCESS, not FAILURE (#9505)
Prevent redundant/conflicting failure handling during shutdown (#9505)
Tests added to verify correct behavior for shutdown and time limit scenarios (#9505)

Poem

In the warren where tasks abound,
Cold shutdowns once made errors resound.
Now with a flag and careful checks,
Success is honored, no more wrecks!
🐇 Celery’s worker hops with pride—
Tasks complete, and bugs subside!
Hooray for a graceful bunny ride!

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
t/unit/worker/test_request.py (1)

873-890: Fix undefined variable in test cleanup.

The variable original_should_terminate is used in the finally block but never defined. This will cause a NameError when the test runs.

Apply this fix to capture the original value before modification:

 def test_on_failure_should_terminate(self):
     from celery.worker import state
+    original_should_terminate = state.should_terminate
     state.should_terminate = True
     job = self.xRequest()
     job.send_event = Mock(name='send_event')
     job.task.backend = Mock(name='backend')

     try:
         try:
             raise KeyError('foo')
         except KeyError:
             exc_info = ExceptionInfo()
             job.on_failure(exc_info)

         job.send_event.assert_not_called()
         job.task.backend.mark_as_failure.assert_not_called()
     finally:
         state.should_terminate = original_should_terminate
🧹 Nitpick comments (1)
celery/worker/consumer/consumer.py (1)

751-754: LGTM: Logic correctly preserves successful tasks from cancellation

The conditional check properly prevents cancellation of requests that are marked as successful, even if they haven't been acknowledged yet. This is a key part of the fix to avoid incorrectly marking successful tasks as failed during shutdown.

The static analysis tool suggests simplifying the condition. While the current implementation is clear, you could consider:

-                if request.id in successful_requests:
-                    # Unless it was successful, in which case we don't want to cancel it.
-                    return False
+                # Unless it was successful, in which case we don't want to cancel it.
+                return request.id not in successful_requests

However, the current approach is more explicit and readable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf1c98b and 25402ef.

📒 Files selected for processing (6)
  • celery/apps/worker.py (1 hunks)
  • celery/worker/consumer/consumer.py (3 hunks)
  • celery/worker/request.py (2 hunks)
  • t/smoke/tests/test_worker.py (1 hunks)
  • t/unit/worker/test_consumer.py (2 hunks)
  • t/unit/worker/test_request.py (2 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
celery/worker/consumer/consumer.py

751-754: Return the negated condition directly

(SIM103)

t/unit/worker/test_request.py

890-890: Undefined name original_should_terminate

(F821)

🪛 GitHub Actions: Linter
t/unit/worker/test_request.py

[error] 890-890: flake8: undefined name 'original_should_terminate' (F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Analyze (python)
  • GitHub Check: docker-build
  • GitHub Check: smoke-tests_latest
  • GitHub Check: Unit (3.12, windows-latest)
  • GitHub Check: Unit (3.12, blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: Unit (3.10, blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: Unit (pypy-3.10, blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: Unit (3.11, blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: Unit (3.13, blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: Unit (3.8, windows-latest)
  • GitHub Check: Unit (3.9, blacksmith-4vcpu-ubuntu-2404)
  • GitHub Check: Unit (3.8, blacksmith-4vcpu-ubuntu-2404)
🔇 Additional comments (9)
celery/apps/worker.py (1)

415-416: LGTM: Clean implementation of termination flag during cold shutdown

The addition of state.should_terminate = True is correctly placed and serves the core purpose of this PR. Setting this flag during cold shutdown allows other components to detect the termination state and avoid inappropriate error handling flows.

celery/worker/consumer/consumer.py (1)

34-35: LGTM: Added necessary import for successful_requests tracking

The import of successful_requests is correctly added to support the new functionality that preserves successful tasks from cancellation during shutdown.

t/unit/worker/test_consumer.py (2)

22-22: LGTM: Import added for test coverage

The import of successful_requests is correctly added to support the new test functionality.


476-494: LGTM: Comprehensive test coverage for the new functionality

The test effectively verifies that successful tasks are preserved from cancellation during shutdown. The test setup is thorough:

  • Creates a mock request with acks_late=True and acknowledged=False
  • Adds the request to both active_requests and successful_requests
  • Verifies that cancel is not called on the successful request
  • Includes proper cleanup to avoid test state leakage

This test ensures the refined cancellation logic works correctly and prevents regression.

celery/worker/request.py (2)

528-537: Good implementation of timeout handling during cold shutdown.

The conditional logic correctly prevents timeout errors from being logged and tasks from being marked as failed when the worker is terminating. This aligns with the PR objective to avoid failing tasks during cold shutdown.


622-626: Correct suppression of failure handling during termination.

The logic properly suppresses failure event sending and marks the return as OK when the worker is terminating, preventing inappropriate failure handling during cold shutdown.

t/unit/worker/test_request.py (1)

959-972: Test logic is correct and follows proper cleanup pattern.

This test correctly captures the original value and restores it in the finally block, ensuring proper test isolation.

t/smoke/tests/test_worker.py (2)

218-236: Excellent test coverage for soft shutdown task completion.

This test validates that tasks can complete successfully during soft shutdown, which directly supports the PR's objective of preventing incorrect task failures during worker termination.


247-278: Comprehensive validation of shutdown behavior with time limits.

This test thoroughly verifies that tasks complete successfully during soft shutdown even when time limits are configured, and confirms that timeout errors are not inappropriately triggered. The assertions for both success state and absence of timeout errors are particularly valuable.

@Nusnus
Copy link
Member Author
Nusnus commented Jul 14, 2025

@auvipy

ughs... i shouldnt trust the copilot suggestion.... now cant revert

What happened?

And yes my friend, copilot is horribly unreliable, always double check yourself 🙏🙏

What can't/should be reverted?

@Nusnus
Copy link
Member Author
Nusnus commented Jul 14, 2025

Also, who turned on coderabbit?
@auvipy let's turn it off please.

The AI-based solutions aren't viable without proper adjustments, we should avoid using them without prior consensus. They may cause a lot of confusion and time wasting.

We'll add back/configure our ecosystem accordingly, but it may cause more harm if not used wisely, and worse, cause confusion and possible bugs which is critical.

Please take note 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks completed during a cold shutdown incorrectly timeout with 5.5
3 participants
0