Technical Analysis: Adding XML Canonicalization to PostgreSQL
Core Problem
XML documents can have multiple valid physical representations that are semantically equivalent. For example, <foo a="1" b="2"/> and <foo b="2" a="1"></foo> represent the same logical document but differ textually. PostgreSQL's xml type has no equality operator (=), making direct comparison impossible. Users need a way to reduce XML documents to a deterministic canonical form so they can be compared as text strings.
The W3C Canonical XML 1.1 specification (C14N 1.1) defines a standard physical representation that normalizes:
- UTF-8 encoding enforcement
- Lexicographic ordering of attributes and namespace declarations
- Entity reference and CDATA section replacement
- Empty element conversion to start-end tag pairs
- Whitespace normalization outside document element
- Superfluous namespace declaration removal
- Default attribute addition
- XML declaration and DTD removal
Evolution of the Proposed Solution
Phase 1: XMLSERIALIZE Extension (Feb 2023 – Sep 2024)
The initial approach extended XMLSERIALIZE syntax with a CANONICAL option:
SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
This required grammar changes to gram.y, a new XmlSerializeFormat enum (XMLSERIALIZE_CANONICAL, XMLSERIALIZE_CANONICAL_WITH_COMMENTS), and implementation in xml.c using libxml2's xmlC14NDocDumpMemory().
Key technical challenges in this phase:
- Encoding handling: The 32-bit Debian CI failure revealed that when
LANG=Cis set (implying a non-UTF-8 locale), character transcoding produced incorrect results. The fix involved properly detecting the input document's declared encoding viaparse_xml_decl()and passingGetDatabaseEncoding()toxml_parse(). - Memory management: Early versions leaked
xmlDocPtrobjects. - CONTENT vs DOCUMENT parsing: C14N requires well-formed XML, so the implementation forces
XMLOPTION_DOCUMENTparsing even for CONTENT inputs that happen to be singly-rooted.
Phase 2: Standalone Function (Sep 2024 – present)
Tom Lane raised a critical architectural concern: extending XMLSERIALIZE syntax risks future collision with SQL committee standardization efforts. The SQL committee might standardize either different semantics for CANONICAL or different syntax for the same functionality. A plain function can never conflict with future SQL grammar changes.
The solution became:
SELECT xmlcanonicalize(xmldoc, keep_comments boolean DEFAULT true);
This is implemented as a built-in SQL function backed by C code in xml.c.
Key Technical Implementation Details
libxml2 Integration
The core implementation calls xmlC14NDocDumpMemory() with the XML_C14N_1_1 enum constant, which implements the full C14N 1.1 specification. The function:
- Parses input XML into a
xmlDocPtrusingxml_parse() - Calls
xmlC14NDocDumpMemory(doc, NULL, XML_C14N_1_1, NULL, keep_comments, &xmlbuf) - Converts the UTF-8 output to server encoding via
pg_any_to_server() - Wraps the call in
PG_TRY/PG_CATCHfor proper cleanup on error
Return Type Decision
An interesting late-stage design discussion: the function initially returned xml, matching other xml*() functions. However, since:
- The canonical form is a serialization artifact, not intended for further XML processing
- The
xmltype lacks an=operator, so comparison (the primary use case) requires casting anyway - C14N output is definitionally text (UTF-8 serialized)
The final version returns text, which better serves the comparison use case directly.
Encoding Correctness
A subtle correctness issue: C14N 1.1 mandates UTF-8 output (libxml2's xmlC14NDocDumpMemory always produces UTF-8). For databases using non-UTF-8 server encodings (e.g., LATIN1, LATIN2), the raw UTF-8 bytes cannot be stored directly in text/varchar columns. The final implementation adds pg_any_to_server() conversion from UTF-8 to the database encoding.
Comments Handling Design
The default for keep_comments evolved through discussion:
- Initially:
false(comments removed by default) — rationale: comparing document content regardless of comments - After review:
true(comments preserved by default) — rationale: removing information should require explicit opt-in; makes code intent clearer when readingxmlcanonicalize(doc, false)
Relationship to Existing XMLSERIALIZE Features
The patch interacts with the earlier INDENT/NO INDENT feature (also authored by Jim Jones). Pavel Stehule raised concerns about conceptual overlap:
NO INDENT: Leaves XML as-is (no pretty-printing). In Oracle, this actively removes formatting.CANONICAL: Produces a normalized form with sorted attributes, resolved entities, etc.
These are distinct operations. NO INDENT controls whitespace presentation; CANONICAL produces a semantically-normalized form for comparison. A potential bug was identified where xmlSaveToBuffer (used by INDENT) ignores elements when whitespace exists between them, but this is orthogonal to canonicalization.
Architectural Significance
This feature fills a real gap in PostgreSQL's XML support. Without canonicalization, users cannot reliably:
- Deduplicate XML documents
- Create unique constraints on XML content
- Compare XML documents in WHERE clauses
- Build indexes on XML document content
The function approach (xmlcanonicalize) is the safest path forward architecturally — it provides the functionality without constraining future SQL/XML standardization of XMLSERIALIZE syntax.