Analysis: pg_strncoll() length argument semantics and collation provider refactoring
Core Problem
This thread sits at the intersection of two concerns in src/backend/utils/adt/pg_locale.c (and related collation plumbing):
-
API ergonomics for collation wrappers. PostgreSQL's internal wrappers
pg_strncoll(),pg_strnxfrm(), andpg_strnxfrm_prefix()dispatch between libc (strcoll_l/strxfrm_l) and ICU (ucol_strcollUTF8etc.) depending on the collation provider. Because libc's string routines require NUL-terminated input while ICU accepts an explicit length (with-1meaning "NUL-terminated"), the PG wrappers historically did substantial bookkeeping to reconcile the two ABIs — often copying into a temp buffer just to NUL-terminate for libc. -
Longer-term architectural direction. Jeff Davis has been incrementally refactoring the locale subsystem so that the collation provider is effectively a vtable (
pg_locale_tcarrying acollate_methodspointer). The end state is thatdatlocprovidercould become an OID referencing aCREATE LOCALE PROVIDERentry, with a handler function producing the method table — i.e., a pluggable/extensible collation provider, analogous to table AMs or index AMs.
The initial commit adopted ICU's "-1 means NUL-terminated" convention for the PG-level wrappers, on the grounds that it simplified call sites that already had NUL-terminated C strings (avoiding a strlen() and/or a copy).
The Pushback from Andres
Andres Freund's objection is not about aesthetics but about tooling and invariants:
- He is developing compiler annotations (in the family of
__attribute__((access(read_only, n, len)))/alloc_size-style contracts) that declare the memory range a function may access. These enable:- Compile-time
-Wstringop-overflow-class diagnostics. - Much more precise UBSan instrumentation — specifically the ability to catch intra-allocation overruns (e.g., reading past a struct field into adjacent valid memory). ASan/Valgrind cannot catch this class of bug because the underlying allocation is legitimately addressable; only size-aware instrumentation can.
- Compile-time
- A sentinel value like
-1(meaning "compute length yourself via NUL scan") defeats these annotations, because the length parameter is no longer an honest upper bound on the access.
Andres also makes two code-quality arguments that reinforce the tooling point:
- The
-1convention introduces asymmetric branches: the wrapper must handle the case where one argument is-1and the other is a real length, even though no caller actually mixes them. This is API surface that exists only to mirror ICU, not to serve PG callers. - If the real motivation is to avoid an unnecessary
strlen/copy when libc has a NUL-terminated input, the clean answer in the emerging vtable design is to add a separate method slot —collate_methods->strcollalongsidestrncoll— rather than overloading length semantics. This aligns with Jeff's own trajectory toward a proper method table.
Andres's voice carries significant weight here: he is a committer with deep expertise in low-level correctness, memory safety tooling, and the performance/security implications of unsafe C idioms. His concern that "a length that is sometimes not a length" blocks a whole class of static and dynamic analysis is effectively a veto on the convention.
Jeff's Response and the Revert
Jeff Davis (committer, owner of most recent locale work) accepts the argument immediately. The tradeoff calculus changes once the -1 sentinel is framed as blocking a security/correctness tool rather than merely being slightly ugly: matching ICU's API is a weak justification, while enabling better UBSan coverage across every collation call site is a strong one.
The follow-up patch reverts the -1 convention, requiring all callers to pass a real length. Jeff notes the one real cost: backport friction. Locale code has had a steady stream of fixes (collation version mismatches, ICU upgrades, libc locale handling), and reverting a signature change creates churn that complicates cherry-picks across branches. This is accepted as worth the price.
Key Technical Insights
The vtable design is the right home for the "NUL-terminated fast path"
The original justification — avoiding a copy when libc gets a NUL-terminated string — is legitimate. The architectural answer is not to overload the length parameter but to expose two method slots. A provider that can exploit NUL-termination (libc) publishes both; a provider that cannot or does not care (ICU, where the underlying call already takes a length) publishes only the length-taking variant. This preserves the invariant that a size_t/ssize_t length parameter always bounds the access.
Sentinel lengths are hostile to memory-safety tooling
This is a generalizable lesson for PostgreSQL: any internal API that uses an in-band sentinel (-1, (size_t)-1, 0 meaning "unknown") in a length parameter forecloses the use of compiler access annotations and degrades UBSan/fortify-source coverage. As PG invests more in sanitizer-based CI (Andres has been a primary driver here), these sentinels become a liability.
Incremental refactoring toward pluggable providers
Jeff's broader plan — pg_locale_t as a method-table handle, eventually a CREATE LOCALE PROVIDER DDL with handler functions — parallels how CREATE ACCESS METHOD generalized heap/index AMs. The refactoring in this thread (and others) is preparatory: pushing provider-specific branching out of the call sites and into a dispatch table. Once the dispatch is clean, the step from "hard-coded libc/ICU/builtin dispatch" to "extension-provided dispatch" is mostly a matter of replacing char datlocprovider with an OID and adding a handler contract.
datlocprovider as OID
The suggestion to make datlocprovider an OID is substantive: it means collation providers become first-class catalog objects with dependency tracking, pg_depend entries, and the ability to be shipped by extensions. This would be a catalog-visible change (pg_database column type) requiring pg_upgrade consideration, which is why Jeff mentions it only as direction, not as an immediate proposal.
Participant Dynamics
- Jeff Davis owns the locale refactoring effort and is driving toward pluggable providers. He proposed the
-1convention and accepted its reversal without pushback once presented with the tooling argument. - Andres Freund intervenes not on the refactoring's direction (which he implicitly endorses via the
strcollvsstrncollmethod-slot suggestion) but on a specific API choice that collides with his orthogonal work on access-range annotations. His review functions as a forward-looking constraint: "design the vtable so that lengths remain lengths."
Timeline Note on Dates
The thread spans an unusually long period: initial post August 2024, follow-up patch September 2024, Andres's objection May 2026, Jeff's revert May 2026. The ~20-month gap suggests the original patch was committed and lived in tree, and Andres's objection arose only when his annotation work actually encountered the -1 sentinel in practice — reinforcing that the problem was not predicted a priori but discovered through tool development.