libpq: Process buffered SSL read bytes to support records >8kB on async API

First seen: 2024-09-08 20:07:53+00:00 · Messages: 28 · Participants: 8

Latest Update

2026-05-18 · claude-opus-4-6

libpq Async API Deadlock with SSL Records >8kB

The Core Problem

The libpq asynchronous API has a fundamental architectural flaw: it assumes that after calling pqReadData() once, if no complete protocol message is available, the application should poll the underlying socket for readability before trying again. This assumption is correct for plain TCP (where kernel buffers are the only source of pending data), but breaks catastrophically with TLS and GSS encryption because these layers introduce userspace buffering that is invisible to socket polling.

Why This Causes Deadlocks

The failure mode is a classic producer-consumer deadlock:

  1. The server sends a large protocol message (e.g., a 32KB DataRow) split across multiple TLS records
  2. libpq's pqReadData() calls SSL_read(), which decrypts one TLS record and returns its contents
  3. The decrypted data doesn't form a complete protocol message, so the caller (e.g., PQconsumeInput()) tells the application "I'm busy, poll the socket"
  4. Critical bug: OpenSSL has already read subsequent TLS records from the socket into its internal buffer during the same SSL_read() call (because TCP segments don't align with TLS record boundaries). These bytes are now invisible to poll()/select() on the raw socket.
  5. The server has sent all its data and is waiting for the client's next request
  6. The client is waiting for socket readability that will never come → deadlock

Why Standard PostgreSQL Servers Don't Trigger This

The PostgreSQL server uses OpenSSL's default maximum TLS record size of ~16KB, and libpq uses a 16KB initial input buffer. This means that in practice, a single SSL_read() call retrieves an entire TLS record, and the protocol message parser in fe-protocol3.c pre-expands the buffer (via pqCheckInBufferSpace()) to hold complete messages. This accidentally prevents the bug from manifesting.

However, alternative PostgreSQL-wire-protocol servers (AWS RDS Aurora Serverless, YugabyteDB, CockroachDB) and TLS-terminating proxies use different TLS record sizes (often 12KB or larger than 16KB), which breaks these accidental assumptions.

The Existing "Cheat" That Partially Masks the Bug

libpq's pqSocketCheck() (used by pqWait()) already checks SSL_pending() before polling the socket, which short-circuits the poll if OpenSSL has buffered data. This protects the synchronous API path. But asynchronous API users call PQsocket() and do their own polling — they never benefit from this internal protection. The cheat also doesn't exist for GSS encryption, which is why GSS connections can deadlock even with standard PostgreSQL servers under the right conditions.

Affected Code Paths

Multiple API entry points are vulnerable:

  1. PQconsumeInput() — The most commonly reported case. Called by async clients between poll() calls.
  2. PQconnectPoll() — Demonstrated with a protocol 2.0 error message exceeding 16KB split across TLS records during connection establishment.
  3. PQgetResult() in non-blocking mode — At least one code path returns without fully consuming the buffer.
  4. pqGetCopyData3() — Contains pqWait() calls without preceding pqReadData(), vulnerable even with standard OpenSSL (demonstrated by Andres with readahead enabled).
  5. pqSendSome() in non-blocking mode — Reads data as a side effect, then returns before a pqWait().

Proposed Solutions

Patch 0001: GSS pqWait Fix (Straightforward Backport)

Adds pggss_read_pending() checking to pqSocketCheck(), mirroring the existing SSL pending check. This is characterized as an "obvious oversight" — GSS encryption was never given the same treatment as SSL in this code path.

Patch 0002: pqReadData() Drain Semantics (Architectural Fix)

The core fix ensures that pqReadData() provides the same guarantee for SSL/GSS that raw TCP inherently provides: when it returns, either (1) there are no bytes left pending in transport buffers, or (2) the socket itself is marked readable.

This is implemented via a pqDrainPending() subroutine called at the end of pqReadData() that:

  1. Queries the transport layer for pending byte count (pqsecure_bytes_pending())
  2. Ensures conn->inBuffer has sufficient space (expanding if needed)
  3. Calls pqsecure_read() for exactly the pending bytes (guaranteed not to hit the socket)
  4. Advances conn->inEnd appropriately
  5. Repeats until no bytes remain pending

Design Evolution

Key Architectural Debate: OpenSSL Readahead

Andres Freund raised an important performance concern. OpenSSL's SSL_CTX_set_read_ahead(1) reduces syscalls by ~18% (5561→4556 in pgbench) by reading full TLS frames in one syscall instead of splitting header and payload reads. However:

This disagreement is explicitly deferred — the backpatchable fix does not need to resolve it, but Jacob adds assertions that readahead is off to pin current safety requirements.

The Broader Async API Design Problem

Andres identified that libpq's async APIs are fundamentally difficult to use correctly:

These are deeper design issues that won't be fixed in a backport but inform the long-term direction.

Memory Impact

The drain operation can buffer at most one additional TLS record (~16KB) or GSS token (~16KB) beyond what was already being buffered. Since libpq already routinely doubles its 16KB initial buffer for large messages, this additional memory is considered negligible.

Verification Challenges

The bug is difficult to reproduce with standard PostgreSQL because the server's TLS record sizes happen to align with libpq's buffer sizes. Reproduction requires either:

Heikki developed a Python-based reproducer script (psycotest.py) based on Jacob's packet-size specifications that reliably triggers the bug.