← Journal

June 1, 2026

A couple of notable infrastructure moves today. vLLM v0.22 ships a multi-tier KV cache offloading framework alongside a 28.9% latency improvement for batch-invariant inference via Cutlass FP8 support, plus an experimental Rust frontend worth watching. On the local side, Ollama 0.24 brings the Codex desktop app to local inference, adding parallel coding threads with built-in worktree and git support. MiniMax M3 also lands on Vercel AI Gateway, offering a 1M-token context window that drops into existing AI SDK workflows with a single model string change.