libpq Async API Deadlock with SSL Records >8kB
The Core Problem
The libpq asynchronous API has a fundamental architectural flaw: it assumes that after calling pqReadData() once, if no complete protocol message is available, the application should poll the underlying socket for readability before trying again. This assumption is correct for plain TCP (where kernel buffers are the only source of pending data), but breaks catastrophically with TLS and GSS encryption because these layers introduce userspace buffering that is invisible to socket polling.
Why This Causes Deadlocks
The failure mode is a classic producer-consumer deadlock:
- The server sends a large protocol message (e.g., a 32KB DataRow) split across multiple TLS records
- libpq's
pqReadData()callsSSL_read(), which decrypts one TLS record and returns its contents - The decrypted data doesn't form a complete protocol message, so the caller (e.g.,
PQconsumeInput()) tells the application "I'm busy, poll the socket" - Critical bug: OpenSSL has already read subsequent TLS records from the socket into its internal buffer during the same
SSL_read()call (because TCP segments don't align with TLS record boundaries). These bytes are now invisible topoll()/select()on the raw socket. - The server has sent all its data and is waiting for the client's next request
- The client is waiting for socket readability that will never come → deadlock
Why Standard PostgreSQL Servers Don't Trigger This
The PostgreSQL server uses OpenSSL's default maximum TLS record size of ~16KB, and libpq uses a 16KB initial input buffer. This means that in practice, a single SSL_read() call retrieves an entire TLS record, and the protocol message parser in fe-protocol3.c pre-expands the buffer (via pqCheckInBufferSpace()) to hold complete messages. This accidentally prevents the bug from manifesting.
However, alternative PostgreSQL-wire-protocol servers (AWS RDS Aurora Serverless, YugabyteDB, CockroachDB) and TLS-terminating proxies use different TLS record sizes (often 12KB or larger than 16KB), which breaks these accidental assumptions.
The Existing "Cheat" That Partially Masks the Bug
libpq's pqSocketCheck() (used by pqWait()) already checks SSL_pending() before polling the socket, which short-circuits the poll if OpenSSL has buffered data. This protects the synchronous API path. But asynchronous API users call PQsocket() and do their own polling — they never benefit from this internal protection. The cheat also doesn't exist for GSS encryption, which is why GSS connections can deadlock even with standard PostgreSQL servers under the right conditions.
Affected Code Paths
Multiple API entry points are vulnerable:
PQconsumeInput()— The most commonly reported case. Called by async clients betweenpoll()calls.PQconnectPoll()— Demonstrated with a protocol 2.0 error message exceeding 16KB split across TLS records during connection establishment.PQgetResult()in non-blocking mode — At least one code path returns without fully consuming the buffer.pqGetCopyData3()— ContainspqWait()calls without precedingpqReadData(), vulnerable even with standard OpenSSL (demonstrated by Andres with readahead enabled).pqSendSome()in non-blocking mode — Reads data as a side effect, then returns before apqWait().
Proposed Solutions
Patch 0001: GSS pqWait Fix (Straightforward Backport)
Adds pggss_read_pending() checking to pqSocketCheck(), mirroring the existing SSL pending check. This is characterized as an "obvious oversight" — GSS encryption was never given the same treatment as SSL in this code path.
Patch 0002: pqReadData() Drain Semantics (Architectural Fix)
The core fix ensures that pqReadData() provides the same guarantee for SSL/GSS that raw TCP inherently provides: when it returns, either (1) there are no bytes left pending in transport buffers, or (2) the socket itself is marked readable.
This is implemented via a pqDrainPending() subroutine called at the end of pqReadData() that:
- Queries the transport layer for pending byte count (
pqsecure_bytes_pending()) - Ensures
conn->inBufferhas sufficient space (expanding if needed) - Calls
pqsecure_read()for exactly the pending bytes (guaranteed not to hit the socket) - Advances
conn->inEndappropriately - Repeats until no bytes remain pending
Design Evolution
- v1 (Lars's original): Loop around
pqReadData()insidePQconsumeInput()— fixes the symptom but not the root cause - v2 (Jacob's approach): Drain at the
pqReadData()level with transport-specific*_drain_pending()functions that write directly toconn->inBuffer - v3 (Heikki's refinement): Cleaner layering — introduces
pqsecure_bytes_pending()returning a count, then uses existingpqsecure_read()to drain. Avoids having the TLS/GSS layer reach intoconn->inBufferdirectly.
Key Architectural Debate: OpenSSL Readahead
Andres Freund raised an important performance concern. OpenSSL's SSL_CTX_set_read_ahead(1) reduces syscalls by ~18% (5561→4556 in pgbench) by reading full TLS frames in one syscall instead of splitting header and payload reads. However:
- Readahead is fundamentally incompatible with the current async architecture because it hides buffered bytes from
SSL_pending()(onlySSL_has_pending()reveals them) - It changes the upper bound of hidden data from "one TLS record" (~16KB) to
SO_RCVBUF(potentially hundreds of KB) - Jacob argues this door should remain closed; Andres argues entrenching the "no sub-libpq buffering" assumption is a dead end for performance
This disagreement is explicitly deferred — the backpatchable fix does not need to resolve it, but Jacob adds assertions that readahead is off to pin current safety requirements.
The Broader Async API Design Problem
Andres identified that libpq's async APIs are fundamentally difficult to use correctly:
PQconsumeInput()doesn't report whether it actually consumed anything- Documentation implies it consumes "all available input" but it only fills the internal buffer (which may be small)
PQisBusy()returns true even if unconsumed data exists in the socket- There's no way to write an event loop that avoids unnecessary
poll()calls
These are deeper design issues that won't be fixed in a backport but inform the long-term direction.
Memory Impact
The drain operation can buffer at most one additional TLS record (~16KB) or GSS token (~16KB) beyond what was already being buffered. Since libpq already routinely doubles its 16KB initial buffer for large messages, this additional memory is considered negligible.
Verification Challenges
The bug is difficult to reproduce with standard PostgreSQL because the server's TLS record sizes happen to align with libpq's buffer sizes. Reproduction requires either:
- A non-PostgreSQL wire-protocol server (YugabyteDB, CockroachDB, Aurora)
- A TLS-terminating proxy that re-encrypts with different record sizes
- A custom test harness that crafts specific TLS record/message size combinations
- Modifying libpq to enable OpenSSL readahead
Heikki developed a Python-based reproducer script (psycotest.py) based on Jacob's packet-size specifications that reliably triggers the bug.