June 1, 2026

June 1, 2026

infra_api

vLLM v0.22 Hardens DeepSeek, Adds Rust Frontend and Multi-Tier KV Cache

vLLM v0.22.0 lands major DeepSeek V4 hardening, an experimental Rust frontend, and a new multi-tier KV cache offloading framework. Batch-invariant inference also gets a 28.9% latency improvement via Cutlass FP8 support.

vLLM v0.22.0 shipped this week with 459 commits from 230 contributors, 63 of them new. The headline changes touch inference reliability, serving architecture, and memory management in ways that matter if you are running large models in production.

DeepSeek V4 is now a first-class citizen. The model was reorganized into a dedicated vllm/models/deepseek_v4/ package and received a full hardening pass. It gained NVFP4 fused MoE support, full and piecewise CUDA graph execution, and MTP speculative decoding. A set of fused kernels (MegaMoE, mhc, Q-norm, indexer, sparse MLA) also landed alongside accuracy fixes. If you are serving DeepSeek V4, upgrade before doing anything else.

Batch-invariant inference got measurably faster. Cutlass FP8 support brings a 28.9% end-to-end latency improvement for batch-invariant workloads. The same path now supports compile mode on SM80 hardware and an NVFP4 Cutlass linear route. That number is concrete enough to justify a benchmark run on your own traffic.

Model Runner V2 is closing in on becoming the default. MRv2 now automatically selects itself for Qwen3 dense models via an oracle, reloads weights in sleep mode, and supports shared KV-cache layers. It falls back to MRv1 automatically when a KV connector is present, so existing connector-based deployments are not broken. The team is clearly treating MRv2 as the forward path.

A Rust frontend is now in tree. This is experimental, but the implementation moved into the main repository along with a data-parallel supervisor for DP serving. If your team cares about frontend latency or wants to contribute at the serving layer, this is the place to watch.

Multi-tier KV cache offloading extends beyond CPU RAM. A new framework supports a Python filesystem secondary tier, DeepSeek V4, and Mooncake disk offloading. For teams running memory-constrained deployments or long-context workloads, this opens options that simply did not exist in the previous release.

New model architectures in this release include MiniCPM-V 4.6, InternS2 Preview, and OpenVLA. Speculative decoding gains a custom callable proposer backend, post-norm EAGLE-3 support, and peagle speculators.

What to do today: If you serve DeepSeek V4, pull v0.22.0 and benchmark against your previous build. The CUDA graph and MTP additions alone warrant a fresh latency measurement. If you are evaluating multi-tier offloading for long-context or memory-constrained setups, the new framework gives you a disk offload path worth testing in staging before your next capacity crunch.