2026-06-04 · claude-opus-4-6

Technical Analysis: Use ereport() instead of elog() for invalid weights in setweight()

Core Problem

The setweight() function in PostgreSQL's full-text search subsystem (tsvector_op.c) uses elog(ERROR, ...) to report invalid weight arguments, which produces an internal error with SQLSTATE XX000 (internal_error). This is semantically incorrect because the weight parameter comes directly from user input, not from an internal programming invariant violation. The distinction matters for:

Client error handling: Applications that catch errors by SQLSTATE class cannot distinguish a genuine internal bug (XX000) from a simple invalid-parameter error. ERRCODE_INVALID_PARAMETER_VALUE (22023) is the correct classification.
Error message quality: The existing elog() call prints the invalid weight character as a raw ASCII integer (e.g., "unrecognized weight: 112" for 'p'), which is unhelpful to end users.
Consistency: The adjacent function ts_filter() in the same source file already correctly uses ereport() with ERRCODE_INVALID_PARAMETER_VALUE for the equivalent validation, making the inconsistency an obvious oversight.

This falls into a broader ongoing effort in PostgreSQL to audit and fix places where user-reachable error paths incorrectly use internal error codes—a cleanup initiative referenced by the thread as having been previously discussed in the context of Michaël Paquier's work.

Affected Code Paths

Three specific error paths in tsvector_op.c are affected:

tsvector_setweight() (two-argument form): The standard setweight(tsvector, "char") function that assigns a weight label to all lexemes.
tsvector_setweight_by_filter(): The three-argument variant setweight(tsvector, "char", text[]) that assigns weight only to specified lexemes.
Weight validation in ts_filter(): Already correct but relevant as a model for the fix.

The tsvector weight system supports four labels: A, B, C, D (case-insensitive). Any other character value should produce a clear user-facing error.

Proposed Solutions

v1 (Ewan Young)

A straightforward patch that:

Replaces elog(ERROR, "unrecognized weight: %d", ...) with ereport(ERROR, errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("unrecognized weight: ...")) in both tsvector_setweight() and tsvector_setweight_by_filter()
Adds regression tests covering all three error paths (which had zero test coverage before)
Models the fix on the existing pattern in ts_filter()

v2 (Tom Lane)

A more architecturally considered revision that:

Factors out the weight-parsing and validation logic into a new shared helper function (tentatively named parse_weight)
Addresses a character encoding safety hazard: printing non-ASCII characters (high-bit-set values) directly into error messages risks generating encoding-invalid strings, which could cause cascading errors or security issues. The numeric printing in the original code was likely a deliberate (if poorly documented) defense against this.
Eliminates code triplication: Rather than having three nearly-identical validation blocks, consolidates into a single function that handles both the validation logic and safe error reporting for non-ASCII characters.
Omits regression tests: Tom Lane expressed skepticism about needing regression tests for simple error-path validation like this.

Key Design Tradeoff: Encoding Safety vs. Message Clarity

The most interesting technical insight in this thread is Tom Lane's observation about why the original code printed weights numerically. It wasn't mere laziness—it was a (poorly documented) defense against encoding corruption:

PostgreSQL supports multi-byte server encodings (UTF-8, EUC-*, etc.)
A "char" type column stores a single byte
If that byte has its high bit set (values 128-255), naively printing it as %c in an error message could produce an invalid byte sequence in the server encoding
This could cause the error reporting system itself to fail, or produce garbled/dangerous output

The existing ts_filter() code that was being used as a model for the fix was actually also vulnerable to this hazard—it prints the character directly without checking for non-ASCII values. Tom Lane's v2 addresses this latent bug as well.

Architectural Implications

The refactoring into parse_weight represents a minor but well-motivated application of DRY (Don't Repeat Yourself) principles in PostgreSQL's codebase. The function likely:

Takes a char weight argument
Returns the internal weight bitmask (0-3 mapping)
Raises ereport(ERROR, ERRCODE_INVALID_PARAMETER_VALUE, ...) with a safe error message on invalid input
Handles the ASCII vs. non-ASCII distinction in error formatting in one place

This is consistent with PostgreSQL's general evolution toward cleaner internal APIs and away from ad-hoc inline validation.

Use ereport() instead of elog() for invalid weights in setweight()

Latest Update