Analysis: Code Contributions Metrics for PostgreSQL — Methodology, Automation, and Identity Resolution
Core Problem
This thread spans two years (2024-2025) of Robert Haas's annual effort to quantify PostgreSQL code contributions. The fundamental challenge is attribution accuracy in a large open-source project with diverse international contributors, inconsistent metadata in commit messages, and inherently ambiguous authorship boundaries. The problem has several dimensions:
-
Identity resolution: Contributors may have their names spelled differently across commits (e.g., "Hou Zhijie" vs "Zhijie Hou", "Hayato Kuroda" vs "Kuroda Hayato"), use different email accounts, or have email display names that don't match their actual identity.
-
Multi-author attribution: Many patches are collaborative, but the methodology only credits a single "primary author" per commit. There is no standardized way to apportion credit among co-authors.
-
Mechanical vs. substantive code changes: Large commits (pgindent runs, translation updates, Unicode table regeneration) must be excluded or adjusted to avoid distorting the statistics. Reverts must also be handled carefully.
-
Measuring non-code contributions: Review work is enormously important but poorly captured by commit-based metrics. The
Reviewed-byheader exists but has significant limitations as a metric.
Evolution from 2024 to 2025
A notable improvement between the two years was automation of author extraction. In 2024, Robert manually inspected all commit messages to determine primary authorship. By 2025, he was able to programmatically extract the first Author: or Co-Authored-By: tag from commit messages. This reflects the community's gradual adoption of structured commit message metadata (a trend driven by discussions around commit message standards).
The 2025 analysis also introduced a more nuanced approach to mechanical commits: rather than simply zeroing out xlines for commits touching Unicode tables, Robert decremented xlines by the size of the Unicode table changes specifically, preserving credit for the non-mechanical portions of those commits.
Key Technical Decisions and Tradeoffs
Single-Author Attribution
Robert's methodology credits only the first listed author. Álvaro Herrera suggested crediting all lines to all authors (a "full credit" model), but Robert rejected this because it would disproportionately inflate committer statistics — a committer who adds 50 lines to a 1000-line patch would receive credit for 1000 lines. This is a deliberate tradeoff favoring precision over inclusivity.
Lines Metric
The lines count is derived from git log --shortstat -w -M, meaning:
-wignores whitespace-only changes (explaining why a commit adding 3 lines might showlines=2)-Menables rename detection, avoiding inflated counts from file moves
Revert Handling
The approach is to set xlines=0 for both the revert and the reverted commit, but only when both occur in the same calendar year. In 2025, reverts were small enough that Robert didn't pursue this rigorously.
Reviewer Statistics
Dmitry Dolgov raised the possibility of gathering Reviewed-by statistics. Robert acknowledged the value but identified fundamental limitations:
- Reviews vary enormously in depth and effort
- Not all helpful contributions are formal "reviews"
Reviewed-byonly captures reviews of committed work, missing reviews of rejected patches- The rise of AI-generated emails makes email-count metrics increasingly unreliable
Identity Resolution Challenges
Bertrand Drouvot's use of pg_trgm similarity matching was an elegant approach to finding duplicate entries:
SELECT a1.author, a2.author, similarity(a1.author, a2.author)
FROM top_authors2025 a1
JOIN top_authors2025 a2 ON a1.author < a2.author
WHERE similarity(a1.author, a2.author) > 0.6
This identified several categories of problems:
- Name order reversal (Eastern vs Western ordering): "Hou Zhijie" / "Zhijie Hou"
- Transliteration variants: "Andrei Lepikhov" / "Andrey Lepikhov", "Mihail Nikalayeu" / "Mikhail Nikalayeu"
- Typos: "Lukas Fitti" / "Lukas Fittl"
- False positives: "Dmitry Koval" / "Dmitry Kovalenko" (different people)
Robert noted a "somewhat alarming" trend of contributors (particularly from Russia and China) having email display names completely different from their actual names, making automated identity resolution even harder.
Broader Metadata Standardization Discussion
Daniel Gustafsson proposed settling on a metadata standard for commit messages that would identify authors, reviewers, and backpatches in a machine-parseable way. This would lower the barrier for contribution analysis and enable multiple independent analyses to work from consistent data. Joe Conway noted that multiple groups (Robert's blog, the Gustafsson/Giordano PGConf presentation, the contributor committee) are doing overlapping work that could benefit from standardization.
Incidental Finding: Commit Message Style
Álvaro Herrera's analysis of period usage in commit subjects revealed a clear bimodal distribution among committers — most are firmly in one camp or the other, with only Robert Haas and Joe Conway showing inconsistency. While tangential, this illustrates how the contribution dataset enables meta-analysis of project conventions.
Limitations Acknowledged
Robert has consistently been transparent about methodological limitations:
- Line counts are a crude proxy for contribution significance
- Single-author attribution understates collaborative work
- Email counts for reviewer activity are increasingly unreliable
- Mechanical changes require manual identification and exclusion
- The approach rewards volume over impact