[PATCH] libpq: try all addresses for a host before moving to next on target_session_attrs mismatch

First seen: 2026-03-05 14:49:18+00:00 · Messages: 24 · Participants: 10

Latest Update

2026-05-14 · claude-opus-4-6

Technical Analysis: libpq Address Iteration on target_session_attrs Mismatch

Core Problem

When libpq resolves a hostname to multiple IP addresses (via DNS A records), and a TCP connection succeeds but is rejected due to target_session_attrs mismatch (e.g., connecting to a standby when target_session_attrs=read-write), libpq currently skips all remaining addresses for that hostname and moves to the next host in the connection string. This behavior was deliberately introduced by Robert Haas in commit 721f7bd3cbc (2016).

This creates a fundamental limitation for DNS-based service discovery in HA PostgreSQL clusters: you cannot point a single DNS name (with multiple A records for all cluster members) at a cluster and rely on target_session_attrs=read-write to find the primary by iterating through resolved addresses. Only the first responding IP is tried before libpq gives up on that hostname entirely.

Why This Matters Architecturally

The issue sits at the intersection of three concerns:

  1. Network-level host resolution — what does "a host" mean when DNS returns multiple IPs?
  2. Application-level routing — finding a server with specific characteristics (read-write, primary, standby)
  3. Connection lifecycle management — how libpq's state machine in PQconnectPoll() transitions between addresses and hosts

The original 2016 design assumed that all IPs behind a single hostname are "the same host" — if one IP responds as read-only, they all will be. This assumption breaks in modern HA deployments where a single DNS entry intentionally encompasses heterogeneous cluster members.

Proposed Solutions

1. Evgeny's Original Patch (This Thread)

Approach: Unconditionally iterate all addresses for a hostname before moving to the next host on target_session_attrs mismatch, matching the existing behavior for connection failures.

Concern raised: This changes default behavior for all users. The localhost127.0.0.1 + ::1 case would double connection failure time and authentication attempts.

2. Evgeny's Refined Proposal

Approach: Only iterate all addresses when target_session_attrs is explicitly set to something other than any. Rationale: users explicitly requesting role-based routing are already signaling they expect probing behavior.

Status: Proposed but not formally patched. Greg Stark supports this direction.

3. Andrew Jackson's check_all_addrs Parameter (Patch 5396)

Approach: Add an explicit connection parameter check_all_addrs (off by default) that enables address iteration on mismatch.

Advantage: Zero regression risk — purely opt-in. Disadvantage: Adds yet another connection parameter to an already complex parameter space.

4. SRV/SVCB DNS Records (Jacob Champion's Preferred Direction)

Approach: Use DNS service records (RFC 9460 SVCB) which are explicitly designed for service discovery — providing alternative authoritative endpoints with per-endpoint parameters.

Advantage: Architecturally clean separation of concerns. Aligns with what browsers and other cluster-aware software are doing. Doesn't conflate "host" with "bag of arbitrary IPs." Disadvantage: Requires DNS infrastructure changes, new libpq DNS resolution code, and doesn't help users today.

5. Protocol-Level Redirect (Evgeny's Suggestion)

Approach: A standby could redirect clients to the primary at the protocol level (analogous to HTTP 302).

Advantage: The cluster knows its own topology; libpq stays "fast and dumb." Disadvantage: Requires protocol changes, doesn't work through proxies (as Zsolt Parragi noted — "Standby saying 'primary is at 10.0.0.42:5432' isn't helpful to the client, proxies exist").

6. Andrew Jackson's Connection Parameter Lookup via HTTP (Patch 6614)

Approach: Use libcurl to fetch connection parameters from an HTTP endpoint, enabling managed service operators to dynamically update connection info without client changes.

Status: Very rough patch, acknowledged as controversial.

Key Technical Disagreement: What Is a "Host"?

The fundamental architectural disagreement centers on whether libpq should treat multiple IPs behind a single DNS name as:

Position A (Evgeny, Andrey Borodin, Greg Stark, Nick): A list of candidate endpoints to try — functionally equivalent to host=pg1,pg2,pg3. The source of the address list (DNS vs. connection string) shouldn't matter.

Position B (Jacob Champion, with Tom Lane's initial sympathy): A single logical host. All IPs represent the same service. Redefining "host" as a bag of arbitrary unrelated IPs is architecturally unsound because:

Jacob's analogy is powerful: if a TLS certificate is invalid for one IP, should libpq try the next IP for the same hostname? No — because something is wrong with the host. Similarly, a target_session_attrs mismatch indicates this IP is "a different host" semantically, and treating it as "the same host that gave the wrong answer" conflates abstraction layers.

Documentation Inconsistency

Evgeny and Artem Navrotskiy (from the prior thread) identified that the current docs state: "When multiple hosts are specified, or when a single host name is translated to multiple addresses, all the hosts and addresses will be tried in order, until one succeeds."

The current behavior where target_session_attrs mismatch skips remaining addresses contradicts this. This is either:

Performance and Regression Concerns

Laurenz Albe raised the concrete regression scenario: localhost commonly resolves to both 127.0.0.1 and ::1. If the behavioral change is unconditional:

This concern effectively killed the unconditional behavioral change approach.

Industry Context

Current Status

The thread reached no consensus on implementation approach. Jacob Champion's position — that the correct long-term solution is DNS service records (SVCB/SRV) rather than overloading A-record semantics — appears to be the strongest architectural argument, but offers no short-term relief. The practical patches (this thread's and patch 5396) remain in limbo.

Jacob explicitly noted these patches are unlikely to make PG19 (feature freeze already passed during the discussion) and would need a committer willing to maintain them. No committer has volunteered.

The thread effectively paused after Jacob's May 12 responses, with the discussion having shifted from "should we merge this patch" to "what's the right architectural approach for cluster-aware connection routing in libpq."