OpenCompass 0.5.2 lands with the biggest benchmark expansion in recent memory. Fourteen new benchmarks are now supported out of the box, and several infrastructure fixes tighten the evaluation pipeline for teams running serious eval workloads.
The headline addition is breadth. On the scientific side, the release adds SciReasoner, Biology Instructions, Mol Instructions, and CMPhysBench. For math and reasoning, HMMT2025, AMO-Bench, IMO-Bench, and ProcessBench are now available. Instruction-following gets IFBench. Code evaluation picks up LCB-pro and a buffer-related fix to the existing LiveCodeBench runner. Rounding out the list are ATLAS, OpenSWI, ARC_AGI_2, and PI-LLM.
That range matters. Scientific reasoning and molecular instruction benchmarks are not commonly bundled in a single eval framework. If your team is building or fine-tuning models with domain-specific knowledge, you now have a faster path to structured comparisons without stitching together separate harnesses.
On the model side, the release adds evaluation examples for Intern-S1-Pro and inference support for the TeleChat API. Neither requires changes to your existing configs for other models.
A quieter but useful addition: support for monitoring multi-dimensional evaluation metrics including output length, logprobs, and finish reasons. For product engineers debugging model behavior in evaluation runs, having finish reasons surfaced directly in the metrics layer saves a step.
The bug fix list is short but targeted. OpenAISDKStreaming had issues with output completeness. Those are patched across three separate fixes. The Pyext runtime dependency is removed, which simplifies installation. Pattern matching in Smolinstruct is corrected.
Infrastructure work includes a parametrized timeout in OpenAISDK, a meta logger added to OpenICLInferTask, and an LLM-judge-based config for C-Eval. The CI pipeline was refactored, and unit tests were added for new datasets.
Four new contributors joined the project for this release: @zhuangziGiantfish, @xgao922, @Jensen246, and @ccx06.
What to do today: if you maintain an eval pipeline and need scientific or advanced reasoning benchmarks, pull 0.5.2 and add the relevant configs. If you were blocked by the OpenAISDKStreaming output completeness bug, the fix is in this release. Check the full changelog before upgrading to catch any config-level changes that affect your existing runs.