2026-05-06 · opus 4.7

Logical Replication of Large Objects: Breaching the Catalog Decoding Boundary

The Core Architectural Problem

Large Objects (LOs) in PostgreSQL occupy an awkward middle ground in the storage architecture. They are exposed to users as a first-class data type (via lo_create, lowrite, lo_get, etc.) with their own OID namespace, yet their underlying storage lives in system catalogs: pg_largeobject (the 2KB-chunked page data, keyed by (loid, pageno)) and pg_largeobject_metadata (ownership/ACL). This catalog residency is precisely what has historically excluded LOs from logical replication.

Logical decoding intentionally refuses to decode catalog changes. The reason is semantic: a decoded pg_class insert is meaningless in isolation — the receiving side has no way to reconstruct the dependent relfilenode, pg_attribute rows, constraints, statistics, etc. The decoder therefore treats catalog tuples as opaque internal bookkeeping. But as Dilip Kumar observes, pg_largeobject is not like other catalogs — its rows carry pure user payload (bytea chunks), and are modified at DML frequency, not DDL frequency. The catalog classification is essentially an implementation artifact, not a semantic truth.

This means a large installed base of applications that rely on LOs (legacy JDBC/ODBC applications, systems storing BLOBs, etc.) cannot use logical replication at all — a significant gap now that logical replication is the strategic path for major-version upgrades, selective replication, and cross-version migration.

Dilip's Proposed Decoding Strategy

The proposal works with the existing decoding infrastructure rather than against it:

Selectively opt pg_largeobject into logical WAL logging via a configuration parameter (GUC or per-publication option). Today pg_largeobject inserts/updates go through the standard heap WAL path but are filtered out of decoding because of its catalog status. The patch would flip that filter conditionally.
Translate raw heap tuples into logical LO operations in the decoder itself. A pg_largeobject row is (loid, pageno, data). The decoder synthesizes a LO_WRITE operation with offset = pageno * LOBLKSIZE, loid, and the data payload. This is an important design choice: rather than reconstructing user-level lowrite calls from multiple fragmented tuples, each 2KB chunk update becomes its own LO_WRITE. This matches how row UPDATEs work today (whole-row logging, not column-delta logging), and Dilip explicitly calls out this symmetry as justification.
Apply worker issues lo_open + lo_seek + lowrite on the subscriber, going through the SQL-callable API so that permissions, triggers, and integrity checks apply normally.

The key architectural win of this approach is that the subscriber never sees raw pg_largeobject tuples — the replication protocol stays at the logical LO operation level, which insulates subscribers from any future change to LO storage representation (e.g., if LOBLKSIZE changed, or if LO storage were ever restructured).

The lo_create Design Tension

The thread surfaces a genuine design disagreement about where lo_create belongs:

DDL-like interpretation: lo_create allocates an OID and inserts into pg_largeobject_metadata. Treating it as schema setup means users create LOs manually on the subscriber, mirroring how tables are pre-created. Dilip's initial POC takes this stance.
DML-like interpretation: In realistic workloads (bytea-column-less designs from the pre-TOAST era, oid-column tables), lo_create fires on every row insert. Treating it as DDL is operationally untenable.

Bernd Helmle immediately identifies the consequence: if lo_create is not replicated, OID collisions between publisher and subscriber become inevitable whenever the subscriber does any local LO work. Dilip concedes this and proposes routing collision resolution through the existing conflict detection framework — reusing the infrastructure built for row-level conflicts in logical replication (PG 16+/17). This is the right instinct: LO OIDs are semantically equivalent to primary keys.

The Tablesync / Permissions Problem (Nitin's Contribution)

Nitin Motiani's extensions reveal that the hard problem isn't streaming changes — it's the initial table synchronization. pg_largeobject and pg_largeobject_metadata are superuser-owned catalogs, so a non-superuser subscription owner cannot COPY from them. Nitin explores three approaches with escalating sophistication:

Approach 1: Superuser-only LO replication

Simplest: expose pg_largeobject in pg_get_publication_tables output and require superuser subscriptions. Trades accessibility for implementation simplicity. This is likely the path of least resistance for an initial commit but significantly limits usability given the general PostgreSQL trajectory away from requiring superuser for replication (cf. the pg_create_subscription role in PG 16).

Approach 2: Per-owner filtered sync with two tablesync workers

Run separate tablesync workers for pg_largeobject_metadata and pg_largeobject, filtering to LOs owned by the subscription user's counterpart on the publisher. The synchronization hazard is severe: lo_write on the subscriber validates metadata existence, so if the pg_largeobject worker outruns the pg_largeobject_metadata worker, writes fail with "large object does not exist." Standard tablesync disables integrity checks during COPY; going through the user-facing API loses that escape hatch.

Approach 3: Single tablesync worker using lo_from_bytea (preferred)

The clever synthesis: one worker runs

SELECT m.oid, lo_get(m.oid)
FROM pg_largeobject_metadata m
JOIN pg_user u ON m.lomowner = u.usesysid
WHERE u.usename = CURRENT_USER;

and pipes each row through a refactored lo_from_bytea that creates-and-writes atomically. This eliminates the metadata/data ordering dependency entirely and is ACL-clean because lo_get does an ownership check. The price is performance — materializing an entire LO into a bytea for every object and writing it back chunk-by-chunk through the invocation path.

Approach 4: Bulk write API (performance fix)

Nitin's May 2026 follow-up introduces lo_bulk_write(fds int4[], data bytea[]) and lo_bulk_put(oids oid[], data bytea[]), backed by table_multi_insert on pg_largeobject. Restrictions are principled:

Only for new objects or pure appends — writing mid-object falls back to per-chunk inv_write
Constraint enforced by checking offset % LOBLKSIZE == 0 and current object size

This yields ~3x throughput in microbenchmarks and is independently useful for pg_dump/pg_restore LO handling, which historically suffers from one-at-a-time overhead. Nitin correctly anticipates moving this to a separate thread — it's a self-contained performance improvement whose value doesn't depend on the replication work.

Key Tradeoffs and Open Questions

WAL volume: Logical-logging every pg_largeobject update adds overhead for workloads that don't need LO replication. The GUC/per-publication gate is essential but raises questions about what happens when a publication is added mid-flight — existing LOs would not have been WAL-logged with the required data (FPIs aside).
Granularity: The decoder emits per-chunk (2KB) LO_WRITEs. A 100MB LO update generates 50,000 protocol messages. For high-throughput LO workloads this may be prohibitive; the bulk-write API on the apply side partially compensates but the network/decoding cost remains.
Conflict resolution model: Relying on existing row-conflict infrastructure for LO OID collisions is reasonable, but LOs don't have the same "replica identity" machinery as user tables. The model for detecting and resolving a collision between a locally-created LO OID 1000 and an incoming replicated LO OID 1000 needs specification.
Initial sync semantics for non-superusers: Even Nitin's owner-filtered approach leaves open what happens for LOs owned by users without a subscription counterpart. Does replication silently skip them? That's a data-loss footgun.
lo_unlink, lo_truncate, ACL changes: The POC covers lowrite. The full operation set — truncate, unlink, GRANT/REVOKE on LOs — each needs decoder support, and truncate in particular requires careful handling because it deletes multiple pg_largeobject rows.

Strategic Assessment

The proposal is architecturally sound: the key insight that pg_largeobject is a catalog in name only is correct, and the decode-to-logical-operations approach is the right abstraction boundary. Dilip has strong domain authority here (longtime committer active in logical decoding, parallel workers, and storage), and the acknowledgements list (Amit Kapila, Noah Misch, Joe Conway) indicates this has already had substantial senior review. The remaining work is primarily about permission model and initial-sync semantics, not core decoding design. The bulk-write side-quest from Nitin is likely to land independently and help both replication and dump/restore.

Proposal: Support Logical replication of large objects

Latest Update