Bound memory usage during manual slot sync retries

First seen: 2026-05-15 05:32:43+00:00 · Messages: 16 · Participants: 5

Latest Update

2026-06-01 · claude-opus-4-6

Bound Memory Usage During Manual Slot Sync Retries — May 2026 Summary

Problem & Resolution

The pg_sync_replication_slots() SQL function had an unbounded memory growth problem during its internal retry loop. Unlike the background slotsync worker (which gets natural transaction boundaries between retry cycles), the manual SQL function executes its entire retry loop within a single transaction, causing per-cycle allocations to accumulate indefinitely.

The existing cleanup (list_free_deep(remote_slots)) only freed List cells and top-level RemoteSlot structs, leaving behind strings (slot name, plugin name, database name), quote_literal_cstr() results, TextDatumGetCString() results, standalone TupleTableSlot objects, and list containers from get_local_synced_slots().

Measurements

With 100 failover logical slots:

Committed Solution

A per-retry memory context (cycle_ctx) that wraps each retry iteration and is reset before each new cycle, deterministically reclaiming all per-cycle allocations. The v2 patch was committed by Amit Kapila.

Key design points of the final patch:

  1. Per-retry memory context over individual pfree() calls — future-proof and self-documenting
  2. Slot names needed across retries are explicitly copied into the parent context before cycle context reset
  3. Explicit ExecDropSingleTupleTableSlot() for semantic clarity at ownership boundaries
  4. No list_free() in drop_local_obsolete_slots() — removed in v2 per community consensus to rely on memory context management rather than scattered retail freeing in helpers

Design Discussions Resolved

History (1 prior analysis)
2026-06-01 · claude-opus-4-6

Post-Commit Review and Minor Style Discussion

The thread continued after the patch was committed, with a new participant providing independent validation and a brief style debate that concluded with no further changes needed.

Independent Verification by Khoa Nguyen

Khoa Nguyen (kdnguyen9.oss@gmail.com) provided post-commit validation using the measure_slotsync_memory.sh script from the patch. The data confirms:

  • Pre-patch: Memory grows in jumps (65 KiB and 131 KiB deltas observed over ~12 seconds)
  • Post-patch: Memory remains completely flat (0 delta across all sample points)

This independently corroborates the author's original measurements.

Style Suggestion: oldctx Placement and Naming

Khoa suggested capturing oldctx once outside the retry loop (rather than re-capturing it at the top of each iteration), arguing this better reflects the parent/child memory context relationship rather than a "previous vs. current" context swap. Xuneng Zhou agreed and additionally suggested renaming oldctx to outerctx to make the semantic relationship clearer.

Hou Zhijie's Pushback

Hou Zhijie was -1 on the proposed change, arguing:

  • The refactored version assumes all code after creating sync_retry_ctx runs in a fixed outer context, which is fragile — it could break if someone later inserts a temporary context wrapping the loop body.
  • The current pattern of capturing oldctx at the point of MemoryContextSwitchTo() is more defensive.

Resolution: Status Quo Maintained

Xuneng Zhou analyzed both patterns in the existing tree and concluded the fixed-context-outside-loop pattern is preferred when the saved context has a named semantic role (like "result context" or "caller context") and the code switches to it only for allocations that must specifically live there. The slotsync case doesn't meet these conditions — the requirement is just that slot_names survive sync_retry_ctx resets, not that they specifically need "the context active before sync_retry_ctx was created." All parties appear satisfied with keeping the committed code as-is.