May 25, 2026

May 25, 2026

eval

LLM Agents Lose 30 Points When Backend Constraints Stack Up

A new study finds that LLM coding agents degrade sharply as structural constraints accumulate in backend code generation tasks. Agents that perform well under loose specs often collapse when architecture, ORM, and framework conventions all apply at once.

LLM coding agents look capable until you give them real production requirements. A new paper from Dente, Satriani, and Papotti names the problem directly: constraint decay.

The finding is simple and uncomfortable. As structural requirements accumulate, agent performance drops substantially. Capable configurations lose 30 points on average in assertion pass rates when moving from baseline tasks to fully specified ones. Weaker configurations approach zero.

The researchers ran 80 greenfield generation tasks and 20 feature-implementation tasks across eight web frameworks. They fixed a unified API contract across all tasks, which isolates the effect of structural complexity from functional variation. Evaluation used both end-to-end behavioral tests and static verifiers, a dual approach that catches problems that unit tests alone miss.

The framework choice matters more than many teams expect. Agents succeed in minimal, explicit frameworks like Flask. They perform substantially worse on average in convention-heavy environments like FastAPI and Django. The conventions that make those frameworks productive for humans appear to create ambiguity that agents handle poorly.

When the researchers traced errors to their root causes, data-layer defects dominated. Incorrect query composition and ORM runtime violations were the leading failure modes. This is exactly the category of bug that passes a surface-level functional check but breaks under load or schema change.

Existing benchmarks have not been measuring this. Most reward functionally correct solutions and ignore structural constraints like architectural patterns, database integration, and object-relational mappings. A solution can score well on current benchmarks while being entirely unfit for a production codebase.

The practical implication is direct. If you are using a coding agent to generate or extend backend services, do not treat a passing test suite as sufficient validation. The agent may have satisfied your functional contract while ignoring your structural one. Add static verification for ORM usage, query composition, and framework-specific conventions as a separate gate in your review or CI process. The more constraints your codebase carries, the less you can trust the agent to track all of them simultaneously.