The question is simple: did Claude introduce more bugs into rsync? A developer published a distributional analysis of every rsync release to find out, using severity-weighted bugs per 10 commits and an exact permutation test against the full historical distribution.
The core methodological challenge was sample size. With only a handful of post-Claude releases available, naive approaches like bugs-per-line comparisons or linear regression would drown in noise. The author worked with his wife, who holds a Master's degree in Statistics from Penn State, to settle on a cleaner framing: where do the post-Claude releases fall in the historical distribution, and how likely is it that a random draw from that distribution produces releases as buggy or worse?
That framing matters for anyone running similar analyses on their own codebase. When you have only a few post-AI-assist releases to compare, you almost certainly do not have enough data to build a regression model. A permutation test against the historical baseline is a more honest tool.
On reproducibility, the author was deliberate. All numbers, statistics, and graphs in the report are templated directly from the Python script that ran the analysis, not written by hand. That means no copy-paste errors and no risk of a language model hallucinating a figure into the prose. The scripts (written by a language model) fetch data, load it into DuckDB, build views, and run the statistics. The full pipeline can be run end-to-end from scratch from the public repository.
The author also flagged a transparency concern up front: to avoid the obvious objection that this is just an AI defending AI, he documented exactly how AI assistance was used. The language model wrote scripts and original prose. The methodology and data sources were chosen by humans. The numbers are generated, not narrated.
For product engineers shipping with coding agents today, there are two concrete takeaways. First, if you want to measure whether your AI-assisted commits are changing defect rates, a permutation test against your historical release distribution is likely your best option until you accumulate enough data for anything more sophisticated. Second, template your statistics directly from your analysis scripts into your reports. It removes a whole class of errors that neither humans nor language models are reliably good at catching.