Analysis: Add Regression Test for Mismatched ENCODING and LOCALE in CREATE DATABASE
Core Problem
The PostgreSQL documentation for CREATE DATABASE explicitly states that encoding and locale settings must be compatible, and that an error will be reported if they are not. However, the existing regression test suite lacks coverage for this specific failure mode. This means:
- No regression guard exists to ensure the mismatch detection continues working correctly if the underlying validation logic is refactored.
- Documentation-code parity is not verified — the documented behavior could theoretically diverge from actual behavior without any test catching it.
Technical Context
How Encoding/Locale Validation Works in PostgreSQL
When CREATE DATABASE is executed, PostgreSQL validates that the specified encoding is compatible with the requested locale. This validation occurs in createdb() (in src/backend/commands/dbcommands.c). The key function involved is check_encoding_locale_matches() which verifies that a locale's implied encoding is compatible with the explicitly requested encoding.
For example, the locale en_US.UTF-8 implies UTF-8 encoding. If a user specifies ENCODING LATIN1 alongside LOCALE 'en_US.UTF-8', these are incompatible because LATIN1 (ISO-8859-1) cannot represent the full character set that en_US.UTF-8 locale operations expect.
The specific error path produces a message like:
ERROR: encoding "LATIN1" does not match locale "en_US.UTF-8"
DETAIL: The chosen LC_CTYPE setting requires encoding "UTF8".
Why This Test Matters Architecturally
While the patch is simple (test-only, no backend changes), it addresses a real gap:
- PostgreSQL's locale handling has been significantly refactored in recent versions (particularly with the introduction of ICU as a locale provider and the
LOCALE_PROVIDERoption in PG15+, and furtherbuiltinprovider work in PG17). - The
LOCALEparameter itself is relatively newer syntactic sugar that sets bothLC_COLLATEandLC_CTYPEsimultaneously. - As locale infrastructure continues to evolve, having explicit regression coverage for the encoding/locale mismatch error path ensures that refactoring doesn't accidentally remove or break this validation.
Proposed Solution
The patch adds a test case to the regression suite (likely in src/test/regress/sql/ or potentially src/test/regress/expected/) that:
- Attempts to create a database with intentionally incompatible settings:
CREATE DATABASE dbtest LOCALE 'en_US.UTF-8' ENCODING LATIN1 TEMPLATE template0; - Expects this statement to fail with an appropriate error message.
- Uses
TEMPLATE template0because creating a database from a non-default template with different encoding requires template0 (which has no user objects and allows encoding changes).
Design Considerations
- Platform dependency: The locale
en_US.UTF-8must be available on the test system. This is a common concern for locale-dependent tests. Most CI environments and standard installations have this locale, but it's not universal. - Test placement: This would logically belong alongside other
CREATE DATABASEerror-path tests, possibly in theCREATE DATABASE-specific test file or in a collation/encoding test file. - Minimal scope: The patch deliberately avoids changing any backend behavior, making it low-risk for inclusion.
Assessment
This is a straightforward, low-risk patch that fills a documentation-verified test gap. The main potential concern a reviewer might raise is locale availability on all test platforms. A reviewer might suggest using \! locale -a checks or conditional test logic, or might suggest the test be placed in the TAP test infrastructure (src/test/regress/t/ or src/test/modules/) where platform-conditional logic is easier to implement.
The patch is appropriate for the current development cycle and aligns with the project's ongoing effort to improve test coverage, particularly around the increasingly complex locale/encoding infrastructure.