Refactor: allow pg_strncoll(), etc., to accept -1 length for NUL-terminated cstrings.

First seen: 2024-08-22 18:00:54+00:00 · Messages: 4 · Participants: 2

Latest Update

2026-05-06 · opus 4.7

Analysis: pg_strncoll() length argument semantics and collation provider refactoring

Core Problem

This thread sits at the intersection of two concerns in src/backend/utils/adt/pg_locale.c (and related collation plumbing):

  1. API ergonomics for collation wrappers. PostgreSQL's internal wrappers pg_strncoll(), pg_strnxfrm(), and pg_strnxfrm_prefix() dispatch between libc (strcoll_l/strxfrm_l) and ICU (ucol_strcollUTF8 etc.) depending on the collation provider. Because libc's string routines require NUL-terminated input while ICU accepts an explicit length (with -1 meaning "NUL-terminated"), the PG wrappers historically did substantial bookkeeping to reconcile the two ABIs — often copying into a temp buffer just to NUL-terminate for libc.

  2. Longer-term architectural direction. Jeff Davis has been incrementally refactoring the locale subsystem so that the collation provider is effectively a vtable (pg_locale_t carrying a collate_methods pointer). The end state is that datlocprovider could become an OID referencing a CREATE LOCALE PROVIDER entry, with a handler function producing the method table — i.e., a pluggable/extensible collation provider, analogous to table AMs or index AMs.

The initial commit adopted ICU's "-1 means NUL-terminated" convention for the PG-level wrappers, on the grounds that it simplified call sites that already had NUL-terminated C strings (avoiding a strlen() and/or a copy).

The Pushback from Andres

Andres Freund's objection is not about aesthetics but about tooling and invariants:

Andres also makes two code-quality arguments that reinforce the tooling point:

Andres's voice carries significant weight here: he is a committer with deep expertise in low-level correctness, memory safety tooling, and the performance/security implications of unsafe C idioms. His concern that "a length that is sometimes not a length" blocks a whole class of static and dynamic analysis is effectively a veto on the convention.

Jeff's Response and the Revert

Jeff Davis (committer, owner of most recent locale work) accepts the argument immediately. The tradeoff calculus changes once the -1 sentinel is framed as blocking a security/correctness tool rather than merely being slightly ugly: matching ICU's API is a weak justification, while enabling better UBSan coverage across every collation call site is a strong one.

The follow-up patch reverts the -1 convention, requiring all callers to pass a real length. Jeff notes the one real cost: backport friction. Locale code has had a steady stream of fixes (collation version mismatches, ICU upgrades, libc locale handling), and reverting a signature change creates churn that complicates cherry-picks across branches. This is accepted as worth the price.

Key Technical Insights

The vtable design is the right home for the "NUL-terminated fast path"

The original justification — avoiding a copy when libc gets a NUL-terminated string — is legitimate. The architectural answer is not to overload the length parameter but to expose two method slots. A provider that can exploit NUL-termination (libc) publishes both; a provider that cannot or does not care (ICU, where the underlying call already takes a length) publishes only the length-taking variant. This preserves the invariant that a size_t/ssize_t length parameter always bounds the access.

Sentinel lengths are hostile to memory-safety tooling

This is a generalizable lesson for PostgreSQL: any internal API that uses an in-band sentinel (-1, (size_t)-1, 0 meaning "unknown") in a length parameter forecloses the use of compiler access annotations and degrades UBSan/fortify-source coverage. As PG invests more in sanitizer-based CI (Andres has been a primary driver here), these sentinels become a liability.

Incremental refactoring toward pluggable providers

Jeff's broader plan — pg_locale_t as a method-table handle, eventually a CREATE LOCALE PROVIDER DDL with handler functions — parallels how CREATE ACCESS METHOD generalized heap/index AMs. The refactoring in this thread (and others) is preparatory: pushing provider-specific branching out of the call sites and into a dispatch table. Once the dispatch is clean, the step from "hard-coded libc/ICU/builtin dispatch" to "extension-provided dispatch" is mostly a matter of replacing char datlocprovider with an OID and adding a handler contract.

datlocprovider as OID

The suggestion to make datlocprovider an OID is substantive: it means collation providers become first-class catalog objects with dependency tracking, pg_depend entries, and the ability to be shipped by extensions. This would be a catalog-visible change (pg_database column type) requiring pg_upgrade consideration, which is why Jeff mentions it only as direction, not as an immediate proposal.

Participant Dynamics

Timeline Note on Dates

The thread spans an unusually long period: initial post August 2024, follow-up patch September 2024, Andres's objection May 2026, Jeff's revert May 2026. The ~20-month gap suggests the original patch was committed and lived in tree, and Andres's objection arose only when his annotation work actually encountered the -1 sentinel in practice — reinforcing that the problem was not predicted a priori but discovered through tool development.