May 26, 2026 · wwwatch

May 26, 2026

Evaluating interactive world models just got more rigorous. WBench is a new multi-turn benchmark that puts these models through five distinct tests across 289 test cases and 1,058 interaction turns. After running 20 state-of-the-art models through it, the findings are clear: no single model dominates across all dimensions. If you're building on or comparing world models, this benchmark looks like a useful gut-check for where things actually stand.

tool

cua-driver 0.2.0 Ships a Universal Binary for Mac Builders

cua-driver v0.2.0 lands with a universal binary covering Apple Silicon and Intel, plus a one-line install script. Here is what product engineers need to know to get running today.

framework

ByteDance Rewrites DeerFlow From Scratch as a Super Agent Harness

DeerFlow 2.0 is a ground-up rewrite from ByteDance that orchestrates sub-agents, memory, and sandboxes with extensible skills. It hit the number one spot on GitHub Trending after launch.

framework

LlamaFactory v0.9.4 Raises the Floor on Fine-Tuning Infrastructure

LlamaFactory v0.9.4 drops Python 3.9 and 3.10, migrates to uv, and ships OFT, Megatron-LM, KTransformers, and over 20 new model integrations. Here is what changes for teams running fine-tuning pipelines today.

eval

WBench Puts Interactive World Models Through Five Rigorous Tests

WBench is a new multi-turn benchmark that evaluates interactive world models across five dimensions using 289 test cases and 1,058 interaction turns. Testing 20 state-of-the-art models, it finds no single model dominates across all dimensions.