Technical Analysis: Use ereport() instead of elog() for invalid weights in setweight()
Core Problem
The setweight() function in PostgreSQL's full-text search subsystem (tsvector_op.c) uses elog(ERROR, ...) to report invalid weight arguments, which produces an internal error with SQLSTATE XX000 (internal_error). This is semantically incorrect because the weight parameter comes directly from user input, not from an internal programming invariant violation. The distinction matters for:
- Client error handling: Applications that catch errors by SQLSTATE class cannot distinguish a genuine internal bug (XX000) from a simple invalid-parameter error. ERRCODE_INVALID_PARAMETER_VALUE (22023) is the correct classification.
- Error message quality: The existing
elog()call prints the invalid weight character as a raw ASCII integer (e.g., "unrecognized weight: 112" for 'p'), which is unhelpful to end users. - Consistency: The adjacent function
ts_filter()in the same source file already correctly usesereport()withERRCODE_INVALID_PARAMETER_VALUEfor the equivalent validation, making the inconsistency an obvious oversight.
This falls into a broader ongoing effort in PostgreSQL to audit and fix places where user-reachable error paths incorrectly use internal error codes—a cleanup initiative referenced by the thread as having been previously discussed in the context of Michaël Paquier's work.
Affected Code Paths
Three specific error paths in tsvector_op.c are affected:
tsvector_setweight()(two-argument form): The standardsetweight(tsvector, "char")function that assigns a weight label to all lexemes.tsvector_setweight_by_filter(): The three-argument variantsetweight(tsvector, "char", text[])that assigns weight only to specified lexemes.- Weight validation in
ts_filter(): Already correct but relevant as a model for the fix.
The tsvector weight system supports four labels: A, B, C, D (case-insensitive). Any other character value should produce a clear user-facing error.
Proposed Solutions
v1 (Ewan Young)
A straightforward patch that:
- Replaces
elog(ERROR, "unrecognized weight: %d", ...)withereport(ERROR, errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("unrecognized weight: ..."))in bothtsvector_setweight()andtsvector_setweight_by_filter() - Adds regression tests covering all three error paths (which had zero test coverage before)
- Models the fix on the existing pattern in
ts_filter()
v2 (Tom Lane)
A more architecturally considered revision that:
- Factors out the weight-parsing and validation logic into a new shared helper function (tentatively named
parse_weight) - Addresses a character encoding safety hazard: printing non-ASCII characters (high-bit-set values) directly into error messages risks generating encoding-invalid strings, which could cause cascading errors or security issues. The numeric printing in the original code was likely a deliberate (if poorly documented) defense against this.
- Eliminates code triplication: Rather than having three nearly-identical validation blocks, consolidates into a single function that handles both the validation logic and safe error reporting for non-ASCII characters.
- Omits regression tests: Tom Lane expressed skepticism about needing regression tests for simple error-path validation like this.
Key Design Tradeoff: Encoding Safety vs. Message Clarity
The most interesting technical insight in this thread is Tom Lane's observation about why the original code printed weights numerically. It wasn't mere laziness—it was a (poorly documented) defense against encoding corruption:
- PostgreSQL supports multi-byte server encodings (UTF-8, EUC-*, etc.)
- A
"char"type column stores a single byte - If that byte has its high bit set (values 128-255), naively printing it as
%cin an error message could produce an invalid byte sequence in the server encoding - This could cause the error reporting system itself to fail, or produce garbled/dangerous output
The existing ts_filter() code that was being used as a model for the fix was actually also vulnerable to this hazard—it prints the character directly without checking for non-ASCII values. Tom Lane's v2 addresses this latent bug as well.
Architectural Implications
The refactoring into parse_weight represents a minor but well-motivated application of DRY (Don't Repeat Yourself) principles in PostgreSQL's codebase. The function likely:
- Takes a
charweight argument - Returns the internal weight bitmask (0-3 mapping)
- Raises
ereport(ERROR, ERRCODE_INVALID_PARAMETER_VALUE, ...)with a safe error message on invalid input - Handles the ASCII vs. non-ASCII distinction in error formatting in one place
This is consistent with PostgreSQL's general evolution toward cleaner internal APIs and away from ad-hoc inline validation.