Transformers v5.9.0 Brings Three New Models and One Breaking Change

Transformers v5.9.0 ships three new model architectures, one breaking change that requires immediate attention, and a round of audio improvements. Here is what matters for builders.

Three new architectures

Cohere2Moe brings Command A+ into the library. It is a Mixture-of-Experts model with a hybrid attention pattern that combines sliding window and full attention layers. It supports a very large context window and uses both shared and routed experts. If you are building on long-context retrieval or multi-document reasoning pipelines, this is worth evaluating.

Parakeet TDT lands as a new audio model addition. The source release notes are brief on details, so check the pull request directly for specifics.

HRM-Text is the most architecturally interesting addition. It is a hierarchical recurrent language model with two transformer stacks running in a nested recurrence: one stack handles slow, abstract planning, the other handles fast, detailed computation. It uses PrefixLM attention, where instruction tokens attend bidirectionally and response tokens attend causally. It also adds per-head sigmoid output gates and parameterless RMSNorm. Importantly, this is a base model. There is no instruction tuning or chat template. You would need to handle that yourself. A paper is available and documentation is up.

One breaking change you cannot ignore

If you are using SAM3, EdgeTAM, or SAM3-Lite-Text, the text_embeds input contract has changed. These models now expect full text embeddings instead of just pooler outputs. This aligns them with other models in the library, but it means your existing inference code will break silently or loudly depending on your setup. Audit your input pipelines before upgrading.

Audio gets broader and more robust

AudioFlamingoNext model checkpoints are now supported. Dynamic vision and audio tensors have been extracted into standalone pure functions, which improves compilability of audio and vision encoders. Error messaging is now friendlier when you try to load audio from a video file. New documentation covers audio and video processors.

None of these audio changes are breaking, but the standalone pure functions refactor could affect custom integrations that hook into the encoder internals.

What to do today

If you maintain pipelines using SAM3, EdgeTAM, or SAM3-Lite-Text, update your text_embeds inputs before pulling v5.9.0. If you are exploring long-context MoE models, pull the Cohere2Moe docs and test it against your context window requirements. If you are researching hierarchical reasoning architectures, HRM-Text is a clean base to experiment with, just plan for the absence of instruction tuning.