Transformers v5.9.0 Brings Three New Architectures and One Breaking Change

Transformers v5.9.0 lands with three new model architectures, a breaking input change for vision models, and expanded audio tooling. Here is what matters if you are building on top of the library today.

Three new architectures to know

Cohere2Moe brings Command A+ into the library. It is a Mixture-of-Experts model with a hybrid attention pattern that combines sliding window and full attention layers. It uses both shared and routed experts and supports a very large context window. If you need long-context processing with MoE efficiency, this is the one to reach for first. Full documentation is available.

HRM-Text is the more architecturally unusual addition. It runs a hierarchical recurrent forward pass across two transformer stacks: one for slow, abstract planning and one for fast, detailed computation, nested inside a recurrence loop. It uses PrefixLM attention, where instruction tokens attend bidirectionally and response tokens attend causally. It also adds per-head sigmoid output gates and parameterless RMSNorm. Importantly, HRM-Text ships as a base language model with no instruction tuning or chat templates, so it is raw material for further fine-tuning work. See the paper and docs for details.

Parakeet TDT rounds out the new additions on the audio side.

One breaking change you cannot ignore

If your pipeline touches SAM3, EdgeTAM, or SAM3-Lite-Text, update your inputs before upgrading. The text_embeds argument now expects full text embeddings instead of just pooler outputs. This aligns these models with the rest of the library, but it will silently break existing inference code if you do not update. Check your embedding extraction step first.

Audio improvements across the board

AudioFlamingoNext checkpoints are now supported. Audio and vision encoders have been refactored to use standalone pure functions, which improves compilability. This matters if you are using torch.compile in production. The release also adds clearer error messages when you accidentally try to load audio from a video file, a common stumbling block that previously surfaced as a cryptic failure. New documentation covers audio and video processors for anyone setting these pipelines up fresh.

What to do today

If you are running SAM3 or EdgeTAM in any production or staging pipeline, pin your current version and audit your text_embeds inputs before pulling v5.9.0. If you are experimenting with long-context MoE inference, Cohere2Moe is ready to test now. If you are researching hierarchical or recurrent language model architectures, HRM-Text gives you a documented, library-integrated baseline to build from.