cua-driver v0.2.0 lands with a universal binary covering Apple Silicon and Intel, plus a one-line install script. Here is what product engineers need to know to get running today.
May 26, 2026
Evaluating interactive world models just got more rigorous. WBench is a new multi-turn benchmark that puts these models through five distinct tests across 289 test cases and 1,058 interaction turns. After running 20 state-of-the-art models through it, the findings are clear: no single model dominates across all dimensions. If you're building on or comparing world models, this benchmark looks like a useful gut-check for where things actually stand.
cua-driver v0.2.0 lands with a universal binary covering Apple Silicon and Intel, plus a one-line install script. Here is what product engineers need to know to get running today.
DeerFlow 2.0 is a ground-up rewrite from ByteDance that orchestrates sub-agents, memory, and sandboxes with extensible skills. It hit the number one spot on GitHub Trending after launch.
LlamaFactory v0.9.4 drops Python 3.9 and 3.10, migrates to uv, and ships OFT, Megatron-LM, KTransformers, and over 20 new model integrations. Here is what changes for teams running fine-tuning pipelines today.
WBench is a new multi-turn benchmark that evaluates interactive world models across five dimensions using 289 test cases and 1,058 interaction turns. Testing 20 state-of-the-art models, it finds no single model dominates across all dimensions.