June 1, 2026 · wwwatch

June 1, 2026

A couple of notable infrastructure moves today. vLLM v0.22 ships a multi-tier KV cache offloading framework alongside a 28.9% latency improvement for batch-invariant inference via Cutlass FP8 support, plus an experimental Rust frontend worth watching. On the local side, Ollama 0.24 brings the Codex desktop app to local inference, adding parallel coding threads with built-in worktree and git support. MiniMax M3 also lands on Vercel AI Gateway, offering a 1M-token context window that drops into existing AI SDK workflows with a single model string change.

coding_agent

Ollama 0.24 Brings the Codex Desktop App to Local Inference

Ollama 0.24 ships support for the Codex App, a desktop experience for running parallel coding threads with built-in worktree and git support. Builders can now annotate live local servers, review code, and leave comments without leaving the app.

infra_api

MiniMax M3 Lands on Vercel AI Gateway With 1M Token Context

MiniMax M3, a multimodal model with a 1M-token context window, is now accessible through Vercel AI Gateway. Builders can drop it into existing AI SDK workflows with a single model string change.

infra_api

vLLM v0.22 Hardens DeepSeek, Adds Rust Frontend and Multi-Tier KV Cache

vLLM v0.22.0 lands major DeepSeek V4 hardening, an experimental Rust frontend, and a new multi-tier KV cache offloading framework. Batch-invariant inference also gets a 28.9% latency improvement via Cutlass FP8 support.