[PoC] Add CANONICAL option to xmlserialize

First seen: 2023-02-27 13:16:30+00:00 · Messages: 54 · Participants: 10

Latest Update

2026-05-27 · claude-opus-4-6

Technical Analysis: Adding XML Canonicalization to PostgreSQL

Core Problem

XML documents can have multiple valid physical representations that are semantically equivalent. For example, <foo a="1" b="2"/> and <foo b="2" a="1"></foo> represent the same logical document but differ textually. PostgreSQL's xml type has no equality operator (=), making direct comparison impossible. Users need a way to reduce XML documents to a deterministic canonical form so they can be compared as text strings.

The W3C Canonical XML 1.1 specification (C14N 1.1) defines a standard physical representation that normalizes:

Evolution of the Proposed Solution

Phase 1: XMLSERIALIZE Extension (Feb 2023 – Sep 2024)

The initial approach extended XMLSERIALIZE syntax with a CANONICAL option:

SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;

This required grammar changes to gram.y, a new XmlSerializeFormat enum (XMLSERIALIZE_CANONICAL, XMLSERIALIZE_CANONICAL_WITH_COMMENTS), and implementation in xml.c using libxml2's xmlC14NDocDumpMemory().

Key technical challenges in this phase:

  1. Encoding handling: The 32-bit Debian CI failure revealed that when LANG=C is set (implying a non-UTF-8 locale), character transcoding produced incorrect results. The fix involved properly detecting the input document's declared encoding via parse_xml_decl() and passing GetDatabaseEncoding() to xml_parse().
  2. Memory management: Early versions leaked xmlDocPtr objects.
  3. CONTENT vs DOCUMENT parsing: C14N requires well-formed XML, so the implementation forces XMLOPTION_DOCUMENT parsing even for CONTENT inputs that happen to be singly-rooted.

Phase 2: Standalone Function (Sep 2024 – present)

Tom Lane raised a critical architectural concern: extending XMLSERIALIZE syntax risks future collision with SQL committee standardization efforts. The SQL committee might standardize either different semantics for CANONICAL or different syntax for the same functionality. A plain function can never conflict with future SQL grammar changes.

The solution became:

SELECT xmlcanonicalize(xmldoc, keep_comments boolean DEFAULT true);

This is implemented as a built-in SQL function backed by C code in xml.c.

Key Technical Implementation Details

libxml2 Integration

The core implementation calls xmlC14NDocDumpMemory() with the XML_C14N_1_1 enum constant, which implements the full C14N 1.1 specification. The function:

  1. Parses input XML into a xmlDocPtr using xml_parse()
  2. Calls xmlC14NDocDumpMemory(doc, NULL, XML_C14N_1_1, NULL, keep_comments, &xmlbuf)
  3. Converts the UTF-8 output to server encoding via pg_any_to_server()
  4. Wraps the call in PG_TRY/PG_CATCH for proper cleanup on error

Return Type Decision

An interesting late-stage design discussion: the function initially returned xml, matching other xml*() functions. However, since:

The final version returns text, which better serves the comparison use case directly.

Encoding Correctness

A subtle correctness issue: C14N 1.1 mandates UTF-8 output (libxml2's xmlC14NDocDumpMemory always produces UTF-8). For databases using non-UTF-8 server encodings (e.g., LATIN1, LATIN2), the raw UTF-8 bytes cannot be stored directly in text/varchar columns. The final implementation adds pg_any_to_server() conversion from UTF-8 to the database encoding.

Comments Handling Design

The default for keep_comments evolved through discussion:

Relationship to Existing XMLSERIALIZE Features

The patch interacts with the earlier INDENT/NO INDENT feature (also authored by Jim Jones). Pavel Stehule raised concerns about conceptual overlap:

These are distinct operations. NO INDENT controls whitespace presentation; CANONICAL produces a semantically-normalized form for comparison. A potential bug was identified where xmlSaveToBuffer (used by INDENT) ignores elements when whitespace exists between them, but this is orthogonal to canonicalization.

Architectural Significance

This feature fills a real gap in PostgreSQL's XML support. Without canonicalization, users cannot reliably:

The function approach (xmlcanonicalize) is the safest path forward architecturally — it provides the functionality without constraining future SQL/XML standardization of XMLSERIALIZE syntax.