Five Top LLMs Disagree on Two Thirds of Real Fact-Checks

If you are building a product that leans on a single frontier LLM to label claims as true or false, new data from Lenz Research should give you pause.

A study published at lenz.io put 1,000 recent real user claims through five top frontier models and asked each one for a verdict under a four-bucket rubric: True, Mostly True, Misleading, or False. These are not benchmark items with public answer keys. They are claims real users submitted to a fact-checking platform, which means there is no canonical label to pattern-match against.

The headline number: 67% of claims (672 out of 1,000; 95% CI: 64 to 70%) produced at least one dissenting verdict from the panel, or no majority formed at all. On 34% of claims (343 out of 1,000; 95% CI: 31 to 37%), the spread between the most-disagreeing pair of models was two or more buckets. That is not a calibration difference. That is a substantive disagreement about the answer.

The inter-rater reliability score, Krippendorff's alpha (ordinal), came out at 0.639 across five raters on 1,000 items. The researchers describe this as nontrivial but limited agreement.

The fractures are not evenly distributed across the rubric. Among the 328 claims where all five models agreed, only four were unanimous on Misleading and zero were unanimous on Mostly True. The middle of the rubric is where the panel breaks down. Some models concentrate verdicts at the True and False poles. Others spread across the two middle buckets. The panel converges on the extremes and fragments in the gray zone.

For product engineers, the practical implication is direct. A single-model verdict on a nuanced claim is not a reliable ground truth. It is one opinion from a panel that disagrees roughly two thirds of the time on this class of content.

What should you do with this today? If your pipeline produces a verdict label from one model and surfaces it as a confident answer, add a confidence signal. Run at least a second model on claims where the stakes are high, compare the outputs, and treat any gap of two or more buckets as a flag for human review rather than a final answer. The Lenz data also suggests you should track per-model verdict distributions in your own domain. Some models will skew polar on your content; others will hedge. Knowing which model behaves which way on your specific claim types is more useful than any aggregate benchmark score.