2026-05-27 · claude-opus-4-6

Technical Analysis: Adding XML Canonicalization to PostgreSQL

Core Problem

XML documents can have multiple valid physical representations that are semantically equivalent. For example, <foo a="1" b="2"/> and <foo b="2" a="1"></foo> represent the same logical document but differ textually. PostgreSQL's xml type has no equality operator (=), making direct comparison impossible. Users need a way to reduce XML documents to a deterministic canonical form so they can be compared as text strings.

The W3C Canonical XML 1.1 specification (C14N 1.1) defines a standard physical representation that normalizes:

UTF-8 encoding enforcement
Lexicographic ordering of attributes and namespace declarations
Entity reference and CDATA section replacement
Empty element conversion to start-end tag pairs
Whitespace normalization outside document element
Superfluous namespace declaration removal
Default attribute addition
XML declaration and DTD removal

Evolution of the Proposed Solution

Phase 1: XMLSERIALIZE Extension (Feb 2023 – Sep 2024)

The initial approach extended XMLSERIALIZE syntax with a CANONICAL option:

SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;

This required grammar changes to gram.y, a new XmlSerializeFormat enum (XMLSERIALIZE_CANONICAL, XMLSERIALIZE_CANONICAL_WITH_COMMENTS), and implementation in xml.c using libxml2's xmlC14NDocDumpMemory().

Key technical challenges in this phase:

Encoding handling: The 32-bit Debian CI failure revealed that when LANG=C is set (implying a non-UTF-8 locale), character transcoding produced incorrect results. The fix involved properly detecting the input document's declared encoding via parse_xml_decl() and passing GetDatabaseEncoding() to xml_parse().
Memory management: Early versions leaked xmlDocPtr objects.
CONTENT vs DOCUMENT parsing: C14N requires well-formed XML, so the implementation forces XMLOPTION_DOCUMENT parsing even for CONTENT inputs that happen to be singly-rooted.

Phase 2: Standalone Function (Sep 2024 – present)

Tom Lane raised a critical architectural concern: extending XMLSERIALIZE syntax risks future collision with SQL committee standardization efforts. The SQL committee might standardize either different semantics for CANONICAL or different syntax for the same functionality. A plain function can never conflict with future SQL grammar changes.

The solution became:

SELECT xmlcanonicalize(xmldoc, keep_comments boolean DEFAULT true);

This is implemented as a built-in SQL function backed by C code in xml.c.

Key Technical Implementation Details

libxml2 Integration

The core implementation calls xmlC14NDocDumpMemory() with the XML_C14N_1_1 enum constant, which implements the full C14N 1.1 specification. The function:

Parses input XML into a xmlDocPtr using xml_parse()
Calls xmlC14NDocDumpMemory(doc, NULL, XML_C14N_1_1, NULL, keep_comments, &xmlbuf)
Converts the UTF-8 output to server encoding via pg_any_to_server()
Wraps the call in PG_TRY/PG_CATCH for proper cleanup on error

Return Type Decision

An interesting late-stage design discussion: the function initially returned xml, matching other xml*() functions. However, since:

The canonical form is a serialization artifact, not intended for further XML processing
The xml type lacks an = operator, so comparison (the primary use case) requires casting anyway
C14N output is definitionally text (UTF-8 serialized)

The final version returns text, which better serves the comparison use case directly.

Encoding Correctness

A subtle correctness issue: C14N 1.1 mandates UTF-8 output (libxml2's xmlC14NDocDumpMemory always produces UTF-8). For databases using non-UTF-8 server encodings (e.g., LATIN1, LATIN2), the raw UTF-8 bytes cannot be stored directly in text/varchar columns. The final implementation adds pg_any_to_server() conversion from UTF-8 to the database encoding.

Comments Handling Design

The default for keep_comments evolved through discussion:

Initially: false (comments removed by default) — rationale: comparing document content regardless of comments
After review: true (comments preserved by default) — rationale: removing information should require explicit opt-in; makes code intent clearer when reading xmlcanonicalize(doc, false)

Relationship to Existing XMLSERIALIZE Features

The patch interacts with the earlier INDENT/NO INDENT feature (also authored by Jim Jones). Pavel Stehule raised concerns about conceptual overlap:

NO INDENT: Leaves XML as-is (no pretty-printing). In Oracle, this actively removes formatting.
CANONICAL: Produces a normalized form with sorted attributes, resolved entities, etc.

These are distinct operations. NO INDENT controls whitespace presentation; CANONICAL produces a semantically-normalized form for comparison. A potential bug was identified where xmlSaveToBuffer (used by INDENT) ignores elements when whitespace exists between them, but this is orthogonal to canonicalization.

Architectural Significance

This feature fills a real gap in PostgreSQL's XML support. Without canonicalization, users cannot reliably:

Deduplicate XML documents
Create unique constraints on XML content
Compare XML documents in WHERE clauses
Build indexes on XML document content

The function approach (xmlcanonicalize) is the safest path forward architecturally — it provides the functionality without constraining future SQL/XML standardization of XMLSERIALIZE syntax.

[PoC] Add CANONICAL option to xmlserialize

Latest Update