-
Notifications
You must be signed in to change notification settings - Fork 314
Open
Description
Parent: #3980
Scenario
A consumer process is killed in a way that bypasses the terminate/2 callback:
:killsignal: Even withProcess.flag(:trap_exit, true)(consumer.ex:108), a:killsignal terminates the process immediately without callingterminate. From Erlang docs: "If the process receives a kill signal, it terminates, regardless of the trap_exit flag."- OOM killer: The OS kills the BEAM process or the process is killed by the VM's memory limits.
:brutal_killsupervisor shutdown: If a supervisor is configured withshutdown: :brutal_kill, child processes receive:kill.
What happens
- A transaction arrives.
ShapeLogCollector.publish→ConsumerRegistry.publish→broadcastdelivers the event to the consumer. - The consumer processes the event, replies
:ok.FlushTracker.handle_txn_fragmentrecords the shape inlast_flushedandmin_incomplete_flush_tree. - The consumer is killed (
:killsignal, OOM, etc.) before the storage flush callback fires andnotify_flushedis called. terminate/2does NOT run. No cleanup happens. Nohandle_writer_termination, noremove_shape_async, noFlushTracker.handle_shape_removed.- The shape's entry in FlushTracker becomes the permanent minimum, blocking
last_global_flushed_offsetfrom advancing.
Why existing fixes don't help
- Stale ConsumerRegistry and ShapeStatus entries after consumer exits with unhandled reasons #3864 (fixing
handle_writer_terminationclause 3): Irrelevant —terminatenever runs, sohandle_writer_terminationis never called. - fix(sync-service): recover from dead consumers blocking WAL flush advancement #3975 (broadcast-time recovery): Only detects dead consumers at the next
publishcall. The FlushTracker entry from the previous successful delivery remains stuck. Even if a replacement consumer is started for the next transaction,notify_flushedfor the old transaction's offset will never arrive.
Fix
This scenario can only be addressed by an active detection mechanism in ShapeLogCollector (see parent issue #3980). The terminate callback path is insufficient by definition — no amount of improvement to terminate or handle_writer_termination can help when the callback doesn't execute.
Reactions are currently unavailable
4D84
Metadata
Metadata
Assignees
Labels
No labels