2026-06-01 · claude-opus-4-6

Technical Analysis: UTF-8 Substring Validation Bug Report

Core Problem

The reporter encountered a situation where substring(text, int, int) (the SQL standard positional substring function) raised an "invalid byte sequence for encoding UTF8" error on seemingly valid UTF-8 data, while alternative methods of extracting the same characters worked correctly:

substring(body from '^.{4}') (regex-based extraction) — worked
substring(normalize(body), 1, 4) — worked
substring(body, 1, 2) — worked
substring(body, 1, 3) — ERROR: 0xc3
substring(body, 1, 4) — ERROR: 0xc2
substring(body, 1, 5) — worked

The bizarre pattern — where extracting 3 or 4 characters fails but 2 or 5 succeeds — and where the reported invalid byte (0xc2, 0xc3) doesn't appear in the first few characters, strongly suggests a bug in the internal character-counting/byte-offset calculation within the positional substring() code path.

Technical Architecture: Why This Matters

The Two Substring Code Paths

PostgreSQL implements two distinct substring mechanisms:

Positional substring (substring(text, start, length)): Implemented via text_substr() in varlena.c, which must calculate byte offsets by scanning through the multi-byte string character by character. For UTF-8, this uses pg_mblen() to advance through the string.
Regex-based substring (substring(text FROM pattern)): Uses the regex engine which performs its own character iteration independently.

The fact that these two paths produce different results (one errors, one succeeds) on the same data points to a bug in the byte-offset calculation of the positional path.

The 0xc2/0xc3 Byte Significance

In UTF-8 encoding, bytes 0xC2-0xC3 are lead bytes for 2-byte sequences (representing U+0080 to U+00FF, i.e., Latin-1 supplement characters like accented characters, ñ, etc.). The error "invalid byte sequence for encoding UTF8: 0xc2" means the validation code encountered 0xC2 in a position where it expected either a continuation byte or where the following byte wasn't a valid continuation (0x80-0xBF).

The critical insight is that these bytes appear deeper in the string (not in positions 1-4), suggesting the character-counting logic was miscounting and landing on byte boundaries that split a multi-byte character. The string contains multi-byte UTF-8 sequences (2323 bytes for 2314 characters = 9 two-byte characters), and a miscounted offset could land mid-character.

The Likely Root Cause: Commit 9f4fd119b2c

Fujii Masao's reference to commit 9f4fd119b2c is highly relevant. This commit fixed an issue where text_substr() could perform incorrect validation during substring extraction. The bug manifested specifically when:

The string contained multi-byte UTF-8 characters
The requested substring length caused the byte-offset calculation to land within a multi-byte sequence boundary
A post-extraction validation pass then flagged the apparently-truncated multi-byte sequence

This aligns perfectly with the reporter's symptoms: the error occurs at specific substring lengths (3 and 4) but not others, the reported invalid bytes are multi-byte lead bytes, and the data is demonstrably valid UTF-8 (other functions work correctly on it).

Key Design Insight: Validation vs. Trust

PostgreSQL has a longstanding tension between:

Trusting stored data: Once data passes input validation, internal operations shouldn't need to re-validate
Defensive validation: Re-checking encoding validity during operations to catch corruption

The bug likely existed in an over-eager validation check during substring extraction that was applied to a byte range that didn't align with character boundaries — essentially validating a window of bytes that started or ended mid-character due to an off-by-one or similar calculation error.

Resolution

The reporter ultimately could not reproduce the issue on other machines and acknowledged it may be version-specific. Fujii Masao identified the likely prior fix (commit 9f4fd119b2c), suggesting the reporter was running an older PostgreSQL version that contained this known bug. The thread was effectively closed without a new bug being filed, as the issue appears to have been previously fixed.

Reproducibility Challenges

The reporter noted that:

Pure SQL approaches to load the same data didn't trigger the bug
Shortening the string by removing "normal" characters between the multi-byte sequences also prevented the failure
The issue only manifested when data was loaded via a Python script (likely bypassing some normalization)

This suggests the bug required specific string lengths or byte alignments to trigger, consistent with buffer-boundary or chunk-size related off-by-one errors in the substring implementation.

some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')

Latest Update