← Journal

May 26, 2026

Evaluating interactive world models just got more rigorous. WBench is a new multi-turn benchmark that puts these models through five distinct tests across 289 test cases and 1,058 interaction turns. After running 20 state-of-the-art models through it, the findings are clear: no single model dominates across all dimensions. If you're building on or comparing world models, this benchmark looks like a useful gut-check for where things actually stand.

May 26, 2026 · wwwatch