2026-06-01 · claude-opus-4-6

Analysis: Code Contributions Metrics for PostgreSQL — Methodology, Automation, and Identity Resolution

Core Problem

This thread spans two years (2024-2025) of Robert Haas's annual effort to quantify PostgreSQL code contributions. The fundamental challenge is attribution accuracy in a large open-source project with diverse international contributors, inconsistent metadata in commit messages, and inherently ambiguous authorship boundaries. The problem has several dimensions:

Identity resolution: Contributors may have their names spelled differently across commits (e.g., "Hou Zhijie" vs "Zhijie Hou", "Hayato Kuroda" vs "Kuroda Hayato"), use different email accounts, or have email display names that don't match their actual identity.
Multi-author attribution: Many patches are collaborative, but the methodology only credits a single "primary author" per commit. There is no standardized way to apportion credit among co-authors.
Mechanical vs. substantive code changes: Large commits (pgindent runs, translation updates, Unicode table regeneration) must be excluded or adjusted to avoid distorting the statistics. Reverts must also be handled carefully.
Measuring non-code contributions: Review work is enormously important but poorly captured by commit-based metrics. The Reviewed-by header exists but has significant limitations as a metric.

Evolution from 2024 to 2025

A notable improvement between the two years was automation of author extraction. In 2024, Robert manually inspected all commit messages to determine primary authorship. By 2025, he was able to programmatically extract the first Author: or Co-Authored-By: tag from commit messages. This reflects the community's gradual adoption of structured commit message metadata (a trend driven by discussions around commit message standards).

The 2025 analysis also introduced a more nuanced approach to mechanical commits: rather than simply zeroing out xlines for commits touching Unicode tables, Robert decremented xlines by the size of the Unicode table changes specifically, preserving credit for the non-mechanical portions of those commits.

Key Technical Decisions and Tradeoffs

Single-Author Attribution

Robert's methodology credits only the first listed author. Álvaro Herrera suggested crediting all lines to all authors (a "full credit" model), but Robert rejected this because it would disproportionately inflate committer statistics — a committer who adds 50 lines to a 1000-line patch would receive credit for 1000 lines. This is a deliberate tradeoff favoring precision over inclusivity.

Lines Metric

The lines count is derived from git log --shortstat -w -M, meaning:

-w ignores whitespace-only changes (explaining why a commit adding 3 lines might show lines=2)
-M enables rename detection, avoiding inflated counts from file moves

Revert Handling

The approach is to set xlines=0 for both the revert and the reverted commit, but only when both occur in the same calendar year. In 2025, reverts were small enough that Robert didn't pursue this rigorously.

Reviewer Statistics

Dmitry Dolgov raised the possibility of gathering Reviewed-by statistics. Robert acknowledged the value but identified fundamental limitations:

Reviews vary enormously in depth and effort
Not all helpful contributions are formal "reviews"
Reviewed-by only captures reviews of committed work, missing reviews of rejected patches
The rise of AI-generated emails makes email-count metrics increasingly unreliable

Identity Resolution Challenges

Bertrand Drouvot's use of pg_trgm similarity matching was an elegant approach to finding duplicate entries:

SELECT a1.author, a2.author, similarity(a1.author, a2.author)
FROM top_authors2025 a1
JOIN top_authors2025 a2 ON a1.author < a2.author
WHERE similarity(a1.author, a2.author) > 0.6

This identified several categories of problems:

Name order reversal (Eastern vs Western ordering): "Hou Zhijie" / "Zhijie Hou"
Transliteration variants: "Andrei Lepikhov" / "Andrey Lepikhov", "Mihail Nikalayeu" / "Mikhail Nikalayeu"
Typos: "Lukas Fitti" / "Lukas Fittl"
False positives: "Dmitry Koval" / "Dmitry Kovalenko" (different people)

Robert noted a "somewhat alarming" trend of contributors (particularly from Russia and China) having email display names completely different from their actual names, making automated identity resolution even harder.

Broader Metadata Standardization Discussion

Daniel Gustafsson proposed settling on a metadata standard for commit messages that would identify authors, reviewers, and backpatches in a machine-parseable way. This would lower the barrier for contribution analysis and enable multiple independent analyses to work from consistent data. Joe Conway noted that multiple groups (Robert's blog, the Gustafsson/Giordano PGConf presentation, the contributor committee) are doing overlapping work that could benefit from standardization.

Incidental Finding: Commit Message Style

Álvaro Herrera's analysis of period usage in commit subjects revealed a clear bimodal distribution among committers — most are firmly in one camp or the other, with only Robert Haas and Joe Conway showing inconsistency. While tangential, this illustrates how the contribution dataset enables meta-analysis of project conventions.

Limitations Acknowledged

Robert has consistently been transparent about methodological limitations:

Line counts are a crude proxy for contribution significance
Single-author attribution understates collaborative work
Email counts for reviewer activity are increasingly unreliable
Mechanical changes require manual identification and exclusion
The approach rewards volume over impact

code contributions for 2024, WIP version

Latest Update