2026-05-18 · claude-opus-4-6

pg_stash_advice: Crash Loop from Overlong Stash Name in Persistence Dump File

Executive Summary

This thread reports a serious availability bug in the pg_stash_advice extension (apparently a new contrib module or proposed patch for PostgreSQL): a malformed persistence file can put a PostgreSQL cluster into an unrecoverable crash-restart loop. The bug is a classic input validation gap where a persistence/deserialization path bypasses safety checks that exist on the normal SQL-level API path. While the bug itself is straightforward, it raises important architectural concerns about background worker resilience, persistence file parsing robustness, and the interaction between dshash key constraints and assertion builds.

The Core Problem

What is pg_stash_advice?

pg_stash_advice appears to be a background worker-based module that manages named "stashes" of advisory data (likely optimizer hints or plan-related advice). It uses PostgreSQL's dynamic shared hash (dshash) infrastructure to store stash entries keyed by name, and persists its state to a TSV file ($PGDATA/pg_stash_advice.tsv) so stashes survive restarts.

The Input Validation Gap

The module has two paths that accept stash names:

SQL-level API path: Calls pgsa_check_stash_name(), which validates that the name length is less than NAMEDATALEN (64 bytes, with 63 usable characters). This is the well-guarded front door.
Persistence restore path: The background worker's startup sequence calls pgsa_read_from_disk() → pgsa_restore_stashes() → pgsa_create_stash(), which parses stash names from the TSV file and feeds them directly into dshash_find_or_insert_extended(). This path lacks the length validation.

The dshash Assertion

The dshash table is configured with key_size = NAMEDATALEN. When dshash_find_or_insert_extended() is called, it invokes dshash_strhash() to hash the key. This function contains the assertion:

Assert(strlen((const char *) v) < size);

A stash name of 64+ bytes violates this assertion, causing SIGABRT.

Why This Becomes a Crash Loop

This is where the bug escalates from "crash" to "denial of service against the entire cluster":

The background worker crashes with SIGABRT (signal 6).
The postmaster interprets any background worker crash as a potential shared memory corruption event (this is standard PostgreSQL behavior — the postmaster cannot distinguish between a benign assertion failure and actual memory corruption).
The postmaster initiates full crash recovery: it terminates all backends, reinitializes shared memory, and restarts.
The background worker restarts, reads the same corrupt file, and crashes again.
This cycle repeats indefinitely (~1.5 crashes/second based on the reporter's data).

The cluster is effectively bricked until an administrator manually removes or edits $PGDATA/pg_stash_advice.tsv. This is a significant operational concern because:

No SQL-level intervention is possible (all backends are killed each cycle).
The fix requires filesystem-level access to $PGDATA.
The log messages may scroll rapidly, making diagnosis difficult.
In containerized/managed environments, filesystem access may not be straightforward.

Assertion vs. Non-Assertion Build Behavior

The reporter astutely notes the divergent behavior:

Assert builds (development/CI): Hard crash via SIGABRT, crash loop.
Production builds (no assertions): The dshash key-copy path silently truncates the name to NAMEDATALEN-1 bytes. This avoids the crash but introduces a silent correctness bug — distinct long names sharing a 63-byte prefix would hash-collide and be treated as the same stash. This is arguably worse in some ways, as it creates subtle data corruption without any visible error.

The Proposed Fix

The patch validates the parsed stash-name length inside the persistence file parser, at the same location where other syntax errors already raise ERRCODE_DATA_CORRUPTED. This mirrors the existing pgsa_check_stash_name() check used by the SQL API path.

This is the correct architectural approach because:

Defense in depth: Validation belongs at the deserialization boundary, not just the API boundary.
Consistent error handling: The parser already has an error-reporting framework for corrupt files; this is a natural extension.
Fail-safe behavior: A corrupted/tampered file should produce a clear error and allow the cluster to continue operating (possibly without the stash data), not crash-loop.

The patch also includes a TAP test that presumably creates a file with an overlong name and verifies the server starts cleanly (or produces an appropriate error without crashing).

Architectural Implications and Broader Lessons

Background Worker Crash Semantics

This bug highlights a known architectural tension in PostgreSQL: the postmaster's crash recovery response to background worker failures is a blunt instrument. Any background worker that can be deterministically crashed by persistent state (on-disk files, catalog data, etc.) can brick a cluster. This is a design consideration that all background worker authors must internalize — startup code must be extraordinarily defensive because failures there are catastrophic.

The PostgreSQL community has discussed (in other threads) whether background workers should have a "non-critical" designation that would allow them to fail without triggering full crash recovery, but this remains unresolved.

Persistence File Security Model

The TSV persistence file sits in $PGDATA, so it has the same trust boundary as other PostgreSQL data files — it's trusted to the extent that the filesystem is trusted. However, the file could be corrupted by:

Disk errors
Incomplete writes during crash
Manual editing (administrative mistake)
Symlink attacks (in misconfigured environments)

Any persistence mechanism for background workers must treat its own files as potentially hostile input, which is the lesson this bug reinforces.

The NAMEDATALEN Boundary

The NAMEDATALEN constant (default 64) is one of PostgreSQL's most pervasive compile-time limits. The fact that the threshold is exact (63 bytes OK, 64 bytes crash) reflects the off-by-one nature of C string handling with fixed buffers — NAMEDATALEN includes the null terminator, so the maximum usable length is NAMEDATALEN - 1 = 63.

Assessment

This is a well-reported, clearly analyzed bug with a straightforward fix. The reporter demonstrates solid understanding of the crash propagation mechanics and the difference between assertion and non-assertion builds. The fix is correctly scoped — it addresses the root cause (missing validation in the parser) rather than papering over symptoms. The inclusion of a TAP test is good practice.

The main open question is whether the module should have additional hardening:

Should the worker catch ERRCODE_DATA_CORRUPTED and continue without the persisted data rather than failing to start?
Should there be a startup_cost/retry delay to mitigate crash-loop velocity if other bugs exist?
Should the file be atomically written (write-then-rename) to prevent partial-write corruption?

These are design questions that reviewers of the pg_stash_advice module itself may want to address.

pg_stash_advice: dump file with overlong stash name crashes worker in a restart loop

Latest Update

Incremental Update: Commit and Follow-up Discussion

Resolution: Patch Committed (Simplified)

New Topic: pg_set_stashed_advice() NULL-Handling Design Questions

Haas's Responses

Assessment

pg_stash_advice: Crash Loop from Overlong Stash Name in Persistence Dump File

Executive Summary

The Core Problem

What is pg_stash_advice?

The Input Validation Gap

The dshash Assertion

Why This Becomes a Crash Loop

Assertion vs. Non-Assertion Build Behavior

The Proposed Fix

Architectural Implications and Broader Lessons

Background Worker Crash Semantics

Persistence File Security Model

The NAMEDATALEN Boundary

Assessment