2026-05-18 · claude-opus-4-6

libpq Async API Deadlock with SSL Records >8kB

The Core Problem

The libpq asynchronous API has a fundamental architectural flaw: it assumes that after calling pqReadData() once, if no complete protocol message is available, the application should poll the underlying socket for readability before trying again. This assumption is correct for plain TCP (where kernel buffers are the only source of pending data), but breaks catastrophically with TLS and GSS encryption because these layers introduce userspace buffering that is invisible to socket polling.

Why This Causes Deadlocks

The failure mode is a classic producer-consumer deadlock:

The server sends a large protocol message (e.g., a 32KB DataRow) split across multiple TLS records
libpq's pqReadData() calls SSL_read(), which decrypts one TLS record and returns its contents
The decrypted data doesn't form a complete protocol message, so the caller (e.g., PQconsumeInput()) tells the application "I'm busy, poll the socket"
Critical bug: OpenSSL has already read subsequent TLS records from the socket into its internal buffer during the same SSL_read() call (because TCP segments don't align with TLS record boundaries). These bytes are now invisible to poll()/select() on the raw socket.
The server has sent all its data and is waiting for the client's next request
The client is waiting for socket readability that will never come → deadlock

Why Standard PostgreSQL Servers Don't Trigger This

The PostgreSQL server uses OpenSSL's default maximum TLS record size of ~16KB, and libpq uses a 16KB initial input buffer. This means that in practice, a single SSL_read() call retrieves an entire TLS record, and the protocol message parser in fe-protocol3.c pre-expands the buffer (via pqCheckInBufferSpace()) to hold complete messages. This accidentally prevents the bug from manifesting.

However, alternative PostgreSQL-wire-protocol servers (AWS RDS Aurora Serverless, YugabyteDB, CockroachDB) and TLS-terminating proxies use different TLS record sizes (often 12KB or larger than 16KB), which breaks these accidental assumptions.

The Existing "Cheat" That Partially Masks the Bug

libpq's pqSocketCheck() (used by pqWait()) already checks SSL_pending() before polling the socket, which short-circuits the poll if OpenSSL has buffered data. This protects the synchronous API path. But asynchronous API users call PQsocket() and do their own polling — they never benefit from this internal protection. The cheat also doesn't exist for GSS encryption, which is why GSS connections can deadlock even with standard PostgreSQL servers under the right conditions.

Affected Code Paths

Multiple API entry points are vulnerable:

PQconsumeInput() — The most commonly reported case. Called by async clients between poll() calls.
PQconnectPoll() — Demonstrated with a protocol 2.0 error message exceeding 16KB split across TLS records during connection establishment.
PQgetResult() in non-blocking mode — At least one code path returns without fully consuming the buffer.
pqGetCopyData3() — Contains pqWait() calls without preceding pqReadData(), vulnerable even with standard OpenSSL (demonstrated by Andres with readahead enabled).
pqSendSome() in non-blocking mode — Reads data as a side effect, then returns before a pqWait().

Proposed Solutions

Patch 0001: GSS pqWait Fix (Straightforward Backport)

Adds pggss_read_pending() checking to pqSocketCheck(), mirroring the existing SSL pending check. This is characterized as an "obvious oversight" — GSS encryption was never given the same treatment as SSL in this code path.

Patch 0002: pqReadData() Drain Semantics (Architectural Fix)

The core fix ensures that pqReadData() provides the same guarantee for SSL/GSS that raw TCP inherently provides: when it returns, either (1) there are no bytes left pending in transport buffers, or (2) the socket itself is marked readable.

This is implemented via a pqDrainPending() subroutine called at the end of pqReadData() that:

Queries the transport layer for pending byte count (pqsecure_bytes_pending())
Ensures conn->inBuffer has sufficient space (expanding if needed)
Calls pqsecure_read() for exactly the pending bytes (guaranteed not to hit the socket)
Advances conn->inEnd appropriately
Repeats until no bytes remain pending

Design Evolution

v1 (Lars's original): Loop around pqReadData() inside PQconsumeInput() — fixes the symptom but not the root cause
v2 (Jacob's approach): Drain at the pqReadData() level with transport-specific *_drain_pending() functions that write directly to conn->inBuffer
v3 (Heikki's refinement): Cleaner layering — introduces pqsecure_bytes_pending() returning a count, then uses existing pqsecure_read() to drain. Avoids having the TLS/GSS layer reach into conn->inBuffer directly.

Key Architectural Debate: OpenSSL Readahead

Andres Freund raised an important performance concern. OpenSSL's SSL_CTX_set_read_ahead(1) reduces syscalls by ~18% (5561→4556 in pgbench) by reading full TLS frames in one syscall instead of splitting header and payload reads. However:

Readahead is fundamentally incompatible with the current async architecture because it hides buffered bytes from SSL_pending() (only SSL_has_pending() reveals them)
It changes the upper bound of hidden data from "one TLS record" (~16KB) to SO_RCVBUF (potentially hundreds of KB)
Jacob argues this door should remain closed; Andres argues entrenching the "no sub-libpq buffering" assumption is a dead end for performance

This disagreement is explicitly deferred — the backpatchable fix does not need to resolve it, but Jacob adds assertions that readahead is off to pin current safety requirements.

The Broader Async API Design Problem

Andres identified that libpq's async APIs are fundamentally difficult to use correctly:

PQconsumeInput() doesn't report whether it actually consumed anything
Documentation implies it consumes "all available input" but it only fills the internal buffer (which may be small)
PQisBusy() returns true even if unconsumed data exists in the socket
There's no way to write an event loop that avoids unnecessary poll() calls

These are deeper design issues that won't be fixed in a backport but inform the long-term direction.

Memory Impact

The drain operation can buffer at most one additional TLS record (~16KB) or GSS token (~16KB) beyond what was already being buffered. Since libpq already routinely doubles its 16KB initial buffer for large messages, this additional memory is considered negligible.

Verification Challenges

The bug is difficult to reproduce with standard PostgreSQL because the server's TLS record sizes happen to align with libpq's buffer sizes. Reproduction requires either:

A non-PostgreSQL wire-protocol server (YugabyteDB, CockroachDB, Aurora)
A TLS-terminating proxy that re-encrypts with different record sizes
A custom test harness that crafts specific TLS record/message size combinations
Modifying libpq to enable OpenSSL readahead

Heikki developed a Python-based reproducer script (psycotest.py) based on Jacob's packet-size specifications that reliably triggers the bug.

libpq: Process buffered SSL read bytes to support records >8kB on async API

Latest Update