Fix stale PID race in ConsumerRegistry.unregister_name/1 by alco · Pull Request #3979 · electric-sql/electric

alco · 2026-03-09T12:01:03Z

Summary

Replace the no-op unregister_name/1 with an atomic :ets.match_delete/2 that removes the ETS entry only if the pid matches the calling (dying) process
A naive lookup-then-delete approach is still racey because the entry can be overwritten between the two ETS calls; match_delete is atomic
Add unit tests covering: basic cleanup, skip-if-different-pid, and the full race scenario (old process unregisters after replacement has registered)

Closes #3864

Test plan

mix test test/electric/shapes/consumer_registry_test.exs — 16 tests pass (3 new)
Full test suite passes in CI

🤖 Generated with Claude Code

pkg-pr-new · 2026-03-09T12:02:20Z

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@3979

npm i https://pkg.pr.new/@electric-sql/client@3979

npm i https://pkg.pr.new/@electric-sql/y-electric@3979

commit: ebdb61c

codecov · 2026-03-09T12:03:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.74%. Comparing base (c613cc5) to head (c622202).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@                       Coverage Diff                       @@
##           refactor/static-logger-strings    #3979   +/-   ##
===============================================================
  Coverage                           88.74%   88.74%           
===============================================================
  Files                                  25       25           
  Lines                                2425     2425           
  Branches                              609      613    +4     
===============================================================
  Hits                                 2152     2152           
  Misses                                271      271           
  Partials                                2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`93.93% <ø> (ø)`
packages/y-electric	`56.05% <ø> (ø)`
typescript	`88.74% <ø> (ø)`
unit-tests	`88.74% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

blacksmith-sh · 2026-03-09T12:03:59Z

Found 3 test failures on Blacksmith runners:

Failures

Test	View Logs
`at line 42 in shell electric/at line 42 in shell electric`	View Logs
`at line 45 in shell electric_1/at line 45 in shell electric_1`	View Logs
`at line 97 in shell electric/at line 97 in shell electric`	View Logs

…3964) (#3968) This reverts commit cea4c13. Possibly temporarily. Since this change is likely to have adverse effects on the memory footprint, we've decided to test things out in Cloud without it first.

## Summary Add [TanStack Intent](https://github.com/TanStack/intent) skills to `@electric-sql/client` and `@electric-sql/y-electric` so AI coding agents can work with Electric correctly out of the box. Ships 9 skills covering the full developer journey from shape setup to production deployment. ## Approach Skills were generated using the `@tanstack/intent scaffold` workflow: 1. **Domain Discovery** — analyzed source code, docs, and maintainer interview to map 5 domains and 35 failure modes 2. **Tree Generator** — chose flat structure across 2 packages (8 skills in typescript-client, 1 in y-electric) 3. **Skill Generation** — wrote SKILL.md files with setup patterns, code examples verified against source, and wrong/correct mistake pairs Each skill targets a specific task (not a concept), with code examples validated against the current source at `v1.5.10`. ### Skills | Skill | Package | Type | Key coverage | |-------|---------|------|-------------| | `electric-shapes` | client | core | ShapeStream/Shape config, onError contract, WHERE clauses, type parsers | | `electric-proxy-auth` | client | core | Server proxy setup, ELECTRIC_PROTOCOL_QUERY_PARAMS, secret injection | | `electric-schema-shapes` | client | core | Schema + shape co-design, REPLICA IDENTITY FULL, replica modes | | `electric-debugging` | client | lifecycle | Error retry behavior, MissingHeadersError, fast-loop detection | | `electric-deployment` | client | lifecycle | Docker Compose, env vars, health checks, replication slot cleanup | | `electric-new-feature` | client | lifecycle | Full 5-step setup: schema → proxy → collection → queries → mutations | | `electric-orm` | client | composition | Drizzle/Prisma txid patterns, migration with REPLICA IDENTITY | | `electric-postgres-security` | client | security | Pre-deploy checklist: roles, grants, WAL, connection, publication | | `electric-yjs` | y-electric | composition | ElectricProvider, BYTEA parsing, resume state, awareness | ### Non-goals - No framework-specific skills (TanStack DB handles React/Vue/etc.) - No multi-shape-sync skill (experimental feature, may be removed) - No runtime code changes ### CI Workflows - `validate-skills.yml` — validates SKILL.md files on PRs touching `skills/` - `check-skills.yml` — checks for stale skills on release, opens review PR - `notify-intent.yml` — notifies TanStack/intent when source or docs change ## Verification ```bash cd packages/typescript-client && npx @tanstack/intent validate skills/ cd packages/y-electric && npx @tanstack/intent validate skills/ ``` Both pass with no errors or warnings. ## Files changed - `packages/typescript-client/skills/` — 8 SKILL.md files + 2 reference files (WHERE clause syntax, type parsers) - `packages/y-electric/skills/` — 1 SKILL.md file - `packages/*/package.json` — added `skills` + `bin` to `files`, `@tanstack/intent` devDependency, `bin` entry - `packages/*/bin/intent.mjs` — intent CLI shim - `.github/workflows/` — 3 new workflow files - `.gitignore` — added `_artifacts` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @electric-sql/experimental@6.0.12 ### Patch Changes - Updated dependencies [d1e08b8] - @electric-sql/client@1.5.12 ## @electric-sql/react@1.0.41 ### Patch Changes - Updated dependencies [d1e08b8] - @electric-sql/client@1.5.12 ## @electric-sql/client@1.5.12 ### Patch Changes - d1e08b8: Add TanStack Intent skills for AI agent guidance. Ships 9 skills covering shapes, proxy auth, schema design, debugging, deployment, new feature setup, ORM integration, Postgres security, and Yjs collaboration. ## @electric-sql/y-electric@0.1.38 ### Patch Changes - d1e08b8: Add TanStack Intent skills for AI agent guidance. Ships 9 skills covering shapes, proxy auth, schema design, debugging, deployment, new feature setup, ORM integration, Postgres security, and Yjs collaboration. - Updated dependencies [d1e08b8] - @electric-sql/client@1.5.12 ## expo-db-electric-starter@1.0.13 ### Patch Changes - Updated dependencies [d1e08b8] - @electric-sql/client@1.5.12 ## @core/sync-service@1.4.13 ### Patch Changes - 8daa822: Index `= ANY(array_field)` and `IN (const_list)` WHERE clause expressions for O(1) shape filtering. ANY clauses reuse the InclusionIndex (via single-element array containment), and IN clauses reuse the EqualityIndex (registering each value separately). At 1000 concurrent shapes, fan-out latency improves by 6x (ANY) and 15x (IN) compared to the previous linear scan. Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

https://deploy-preview-3973--electric-next.netlify.app/blog/2026/03/06/agent-skills-now-shipping ## Summary - Announces that Electric, TanStack DB, and Durable Streams npm packages now ship with agent skills - Covers why versioned skills matter (training data staleness, new libraries absent from models) - Points readers to the kpb repo to try it out and the feedback skill to report issues ## Test plan - [ ] Preview the blog post on the website - [ ] Verify hero image renders correctly - [ ] Check all links resolve 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Announces that Electric, TanStack DB, and Durable Streams npm packages now ship with agent skills via the TanStack Intent system - Covers why versioned skills matter (training data staleness, new libraries absent from models) - Points readers to the Playbook repo to try it out and the feedback skill to report issues ## Test plan - [ ] Preview the blog post on the website - [ ] Verify hero image renders correctly - [ ] Check all links resolve (no dead links) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Adds a `CLAUDE.md` symlink pointing to `AGENTS.md` in `packages/sync-service` so that Claude Code automatically loads the sync-service guidelines into context. ## Test plan - [x] Verify symlink resolves correctly: `readlink packages/sync-service/CLAUDE.md` → `AGENTS.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code)

The previous no-op implementation left stale PIDs in the ETS table. A naive lookup-then-delete fix would still be racey because the entry could be overwritten between the lookup and delete. Using :ets.match_delete/2 atomically deletes the entry only if the pid matches the calling (dying) process. Closes #3864 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-09T12:36:42Z

Claude Code Review

Summary

This PR fixes the long-standing race in ConsumerRegistry.unregister_name/1 by replacing the deliberate no-op with an atomic :ets.match_delete/2 call that only removes the ETS entry if it still belongs to the dying process. It also simplifies consumer initialization by removing the redundant :shape pass-through from ShapeCache and directly fetching from ShapeStatus. Three new tests validate the race scenario end-to-end.

What is Working Well

Choosing match_delete over lookup+delete: The issue proposed a two-step lookup/delete which still has a TOCTOU window. The PR improves on this by using :ets.match_delete/2 which is atomic -- no window for a replacement consumer entry to be accidentally removed.
self() correctness: unregister_name/1 is called in the dying process context by OTP cleanup machinery, so self() correctly identifies the dying PID. The comment makes this non-obvious but accurate decision explicit.
Double-removal safety: For the {:shutdown, :cleanup} and {:shutdown, :suspend} paths, remove_consumer already deletes the ETS entry before unregister_name fires. The match_delete on a missing key is a no-op, so no double-removal crash is possible.
Consumer init simplification: Removing the :shape pass-through from ShapeCache.start_opts and always fetching fresh from ShapeStatus is cleaner. The old Map.get_lazy fallback was essentially always used (the cached value was redundant), so the simplification is safe.
Test quality: The three new tests cover the key scenarios well, especially the race scenario test that simulates the exact OTP lifecycle sequence (old process registers, replacement overwrites, old process calls unregister).

Issues Found

Important (Should Fix)

1. Error message regression in consumer.ex:124

File: packages/sync-service/lib/electric/shapes/consumer.ex:124

Issue: The old fetch_shape_by_handle! raised ArgumentError with the shape handle in the message. The new bare {:ok, shape} = produces a MatchError: no match of right hand side value: :error -- the shape handle is absent from the error message.

Impact: If a consumer races with shape deletion (shape removed between consumer spawn and handle_continue execution), the crash log is harder to diagnose. Crash reports in Sentry/telemetry aggregate on the error term, not the GenServer state.

Suggested fix: Explicitly handle the :error case, e.g.:

case ShapeCache.ShapeStatus.fetch_shape_by_handle(stack_id, shape_handle) do
  {:ok, shape} ->
    state = State.initialize_shape(state, shape, config)
    # ... continue init ...
  :error ->
    Logger.warning("Shape handle not found during consumer init, stopping",
      shape_handle: shape_handle)
    {:stop, :normal, state}
end

Or at minimum preserve a descriptive raise so the handle appears in crash logs.

2. Missing changeset entry for the ConsumerRegistry race fix

Issue: The 1.4.13 changelog entry (from the base branch) mentions the ANY/IN WHERE clause index fix, but the ConsumerRegistry race fix from this PR has no corresponding entry. When this branch reaches main, the fix will be invisible in the release history.

Suggested fix: Add a .changeset/*.md file documenting the fix so it appears in the next release CHANGELOG.

Suggestions (Nice to Have)

3. Issue #3864 code path #3 not addressed

The linked issue identified three affected code paths. This PR addresses paths #1 (stale PID blocking await_snapshot_start) and #2 (silent message loss via stale PIDs in publish/2). Code path #3 -- restore_shape_and_dependencies/4 trusting stale dead PIDs without a Process.alive? check -- is untouched. PR #3975 handles runtime dead-PID recovery, but it is worth noting in the PR description whether #3 is intentionally deferred.

4. Unrelated files inflate the diff

The diff includes GitHub Actions workflows (check-skills.yml, notify-intent.yml, validate-skills.yml), SKILL.md files, bin/intent.mjs, and package version bumps for react-hooks, experimental, y-electric, and typescript-client -- all from the base branch (refactor/static-logger-strings). A note in the PR description pointing reviewers to the sync-service changes would reduce review noise.

Issue Conformance

Issue #3864 is well-specified with root cause analysis, three affected code paths, and a proposed fix. This PR implements a strictly better version of the proposed fix (atomic match_delete vs. the TOCTOU-prone lookup+delete in the issue comment). Scope is appropriately narrow.

Previous Review Status

No previous review.

Review iteration: 1 | 2026-03-09

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alco added the claude label Mar 9, 2026

alco and others added 7 commits March 9, 2026 13:28

Revert "fix(sync-service): Pass shape directly to consumer process (#…

b8777d5

…3964) (#3968) This reverts commit cea4c13. Possibly temporarily. Since this change is likely to have adverse effects on the memory footprint, we've decided to test things out in Cloud without it first.

alco force-pushed the alco/stale-consumer-registry branch from dbb2f9f to 44ce5d6 Compare March 9, 2026 12:28

Add changeset entry for ConsumerRegistry race fix

c622202

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This was referenced Mar 9, 2026

FlushTracker has no active mechanism to detect shapes that stop reporting flush progress #3980

Open

Stale ConsumerRegistry and ShapeStatus entries after consumer exits with unhandled reasons #3864

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale PID race in ConsumerRegistry.unregister_name/1#3979

Fix stale PID race in ConsumerRegistry.unregister_name/1#3979
alco wants to merge 8 commits intorefactor/static-logger-stringsfrom
alco/stale-consumer-registry

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Failures

Uh oh!

Uh oh!

Claude Code Review

Summary

What is Working Well

Issues Found

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants