Technical Analysis: UTF-8 Substring Validation Bug Report
Core Problem
The reporter encountered a situation where substring(text, int, int) (the SQL standard positional substring function) raised an "invalid byte sequence for encoding UTF8" error on seemingly valid UTF-8 data, while alternative methods of extracting the same characters worked correctly:
substring(body from '^.{4}')(regex-based extraction) — workedsubstring(normalize(body), 1, 4)— workedsubstring(body, 1, 2)— workedsubstring(body, 1, 3)— ERROR: 0xc3substring(body, 1, 4)— ERROR: 0xc2substring(body, 1, 5)— worked
The bizarre pattern — where extracting 3 or 4 characters fails but 2 or 5 succeeds — and where the reported invalid byte (0xc2, 0xc3) doesn't appear in the first few characters, strongly suggests a bug in the internal character-counting/byte-offset calculation within the positional substring() code path.
Technical Architecture: Why This Matters
The Two Substring Code Paths
PostgreSQL implements two distinct substring mechanisms:
-
Positional substring (
substring(text, start, length)): Implemented viatext_substr()invarlena.c, which must calculate byte offsets by scanning through the multi-byte string character by character. For UTF-8, this usespg_mblen()to advance through the string. -
Regex-based substring (
substring(text FROM pattern)): Uses the regex engine which performs its own character iteration independently.
The fact that these two paths produce different results (one errors, one succeeds) on the same data points to a bug in the byte-offset calculation of the positional path.
The 0xc2/0xc3 Byte Significance
In UTF-8 encoding, bytes 0xC2-0xC3 are lead bytes for 2-byte sequences (representing U+0080 to U+00FF, i.e., Latin-1 supplement characters like accented characters, ñ, etc.). The error "invalid byte sequence for encoding UTF8: 0xc2" means the validation code encountered 0xC2 in a position where it expected either a continuation byte or where the following byte wasn't a valid continuation (0x80-0xBF).
The critical insight is that these bytes appear deeper in the string (not in positions 1-4), suggesting the character-counting logic was miscounting and landing on byte boundaries that split a multi-byte character. The string contains multi-byte UTF-8 sequences (2323 bytes for 2314 characters = 9 two-byte characters), and a miscounted offset could land mid-character.
The Likely Root Cause: Commit 9f4fd119b2c
Fujii Masao's reference to commit 9f4fd119b2c is highly relevant. This commit fixed an issue where text_substr() could perform incorrect validation during substring extraction. The bug manifested specifically when:
- The string contained multi-byte UTF-8 characters
- The requested substring length caused the byte-offset calculation to land within a multi-byte sequence boundary
- A post-extraction validation pass then flagged the apparently-truncated multi-byte sequence
This aligns perfectly with the reporter's symptoms: the error occurs at specific substring lengths (3 and 4) but not others, the reported invalid bytes are multi-byte lead bytes, and the data is demonstrably valid UTF-8 (other functions work correctly on it).
Key Design Insight: Validation vs. Trust
PostgreSQL has a longstanding tension between:
- Trusting stored data: Once data passes input validation, internal operations shouldn't need to re-validate
- Defensive validation: Re-checking encoding validity during operations to catch corruption
The bug likely existed in an over-eager validation check during substring extraction that was applied to a byte range that didn't align with character boundaries — essentially validating a window of bytes that started or ended mid-character due to an off-by-one or similar calculation error.
Resolution
The reporter ultimately could not reproduce the issue on other machines and acknowledged it may be version-specific. Fujii Masao identified the likely prior fix (commit 9f4fd119b2c), suggesting the reporter was running an older PostgreSQL version that contained this known bug. The thread was effectively closed without a new bug being filed, as the issue appears to have been previously fixed.
Reproducibility Challenges
The reporter noted that:
- Pure SQL approaches to load the same data didn't trigger the bug
- Shortening the string by removing "normal" characters between the multi-byte sequences also prevented the failure
- The issue only manifested when data was loaded via a Python script (likely bypassing some normalization)
This suggests the bug required specific string lengths or byte alignments to trigger, consistent with buffer-boundary or chunk-size related off-by-one errors in the substring implementation.