Six AI Coding Tools Face a Real Architectural CAD Challenge

ModelRift published a hands-on benchmark comparing six AI coding tools on a single architectural task: build the Pantheon in OpenSCAD. The tools tested were Codex 5.5 High, Claude Sonnet, Claude Opus, Cursor Composer, Google Antigravity 2.0 (Gemini 3.5 Flash High), and ModelRift (Gemini Flash 3.0).

The choice of the Pantheon was deliberate. Basic OpenSCAD prompts, like producing a cube with a hole, test almost nothing useful. Every current coding LLM handles difference(), cube(), and cylinder() without trouble. The Pantheon sits in a more interesting middle ground: a large radial rotunda, a dome with a central oculus, a rectangular portico, columns, stepped bases, and a triangular pediment. That combination exercises Boolean operations, radial symmetry, extrusions, and constructive geometry together, which is exactly where OpenSCAD is strong.

It is also a recognizable building. A weak result still looks vaguely like a domed structure. A better result has to get the spatial relationship between the round drum, the rectangular portico, the dome rings, and the front facade roughly correct. That makes visual inspection meaningful without requiring a formal scoring rubric.

Why OpenSCAD at all? The format is plain text code with a compact vocabulary. A model can describe architecture as nested transformations, Boolean operations, and named modules. That is much closer to how language models reason about structure than asking them to operate a 3D application through UI actions. ModelRift built its platform around OpenSCAD for exactly this reason.

The workflow used the OpenSCAD CLI to render previews and iterate. Each tool received the same prompt and the same reference images. The results were compared visually, with thumbnails labeled by client and model.

The benchmark is practical, not academic. ModelRift generates OpenSCAD for every 3D model on its platform, so the ability of an LLM to handle spatial geometry directly affects what the team can ship. Tracking how models improve on this kind of task is an operational concern, not just a research curiosity.

For developers building CAD-adjacent tools or geometry pipelines on top of LLMs, the concrete takeaway is this: run your own version of this test before committing to a model. Basic code generation benchmarks will not tell you how a model handles constructive solid geometry, radial symmetry, or the spatial reasoning needed to assemble multi-part architectural structures. Pick a recognizable target with distinct geometric features, render the output programmatically, and compare visually. That process will surface capability gaps that no leaderboard score will show you.