8000 fix(shared-sub): stop monitoring remote subscribers to avoid races [r58] by keynslug · Pull Request #15884 · emqx/emqx · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@keynslug
Copy link
Contributor
@keynslug keynslug commented Sep 10, 2025

Fixes EMQX-14718.

Release version: 5.8.9

Summary

Backports of #14936 + #15396 + #15518.

PR Checklist

  • For internal contributor: there is a jira ticket to track this change
  • The changes are covered with new or existing tests
  • Change log for changes visible by users has been added to changes/ee/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • Schema changes are backward compatible

@keynslug keynslug force-pushed the fix/EMQX-14477/r58/nomon-remote-sharedsub branch 2 times, most recently from 2f7a050 to 46420f3 Compare September 15, 2025 16:19
@codecov-commenter
Copy link
codecov-commenter commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 97.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.30%. Comparing base (4bc95de) to head (a62e6a4).
⚠️ Report is 65 commits behind head on release-58.

Files with missing lines Patch % Lines
apps/emqx/src/emqx_router_helper.erl 93.75% 2 Missing ⚠️
apps/emqx/src/emqx_shared_sub.erl 97.67% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           release-58   #15884    +/-   ##
============================================
  Coverage       82.29%   82.30%            
============================================
  Files            1020     1021     +1     
  Lines           71664    71792   +128     
============================================
+ Hits            58978    59086   +108     
- Misses          12686    12706    +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Server `emqx_shared_sub` manages its own monitors and runs cleanups.
Moreover, during high disconnect rate it can be overloaded with
DOWN signals, and `unsubscribe/3` call can easily timeout.
_Orphaned_ node here means a node that is marked as "routing node" in
the respective table but cluster membership does not know about it.
Which uncovers an oversight in `emqx_router_helper` implementation: it
is too eager to purge "leaving" nodes even though by the time it receives
respective membership event, a node has essentially just started leaving,
but not finished.
And don't react to membership events right away.
Which brings in force-leave announcements to make `emqx_router_helper` react
quicker to cluster membership changes.
Prior to this changes, multiple nodes could in general react to shared
subscriber going down. If, due to load skew, remote node reacted to this
event earlier than local node, leftover state started to accumulate:
this would have eventually manifested in global routing table
inconsistencies.
This is temporary in a sense, `emqx_router_helper` needs to be refactored
into a generic component responsible for cleaning up dead nodes' stuff.
As they are taken care of by `emqx_router_helper` already anyway.
This change:
* Helps to avoid full-scans that were happening in `cleanup_down/1`.
* Simplifies things.
@keynslug keynslug force-pushed the fix/EMQX-14477/r58/nomon-remote-sharedsub branch from 46420f3 to 00d6219 Compare September 16, 2025 16:32
@keynslug keynslug marked this pull request as ready for review September 17, 2025 08:38
@keynslug keynslug requested review from a team and lafirest as code owners September 17, 2025 08:38
@keynslug keynslug merged commit 5239d99 into emqx:release-58 Sep 17, 2025
732 of 736 checks passed
@keynslug key 9C97 nslug deleted the fix/EMQX-14477/r58/nomon-remote-sharedsub branch September 17, 2025 13:56
@emqxqa
Copy link
emqxqa commented Sep 19, 2025

TestExecution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0