OpenCompass 0.5.2 Adds Fourteen Benchmarks for Scientific AI Evaluation

OpenCompass 0.5.2 lands with a substantial expansion of evaluation coverage. The headline is fourteen new benchmarks, ranging from scientific reasoning to competitive math to code quality. If you are building or benchmarking models that touch technical domains, this release closes a lot of gaps that existed in the previous version.

The new benchmark additions include SciReasoner, Biology Instructions, Mol Instructions, CMPhysBench, IFBench, LCB-pro, PI-LLM, ProcessBench, ARC_AGI_2, HMMT2025, AMO-Bench, IMO-Bench, ATLAS, and OpenSWI. That is a wide sweep across scientific, mathematical, and instruction-following evaluation categories. Notably, IFBench and SciReasoner are called out as highlights in the release, signaling that the maintainers consider these particularly meaningful additions for teams running rigorous evaluations.

On the model and API side, the release adds evaluation examples for Intern-S1-Pro and inference support for the TeleChat API. If your team uses either of these, you can now plug them directly into an OpenCompass evaluation pipeline without custom wiring.

The release also adds monitoring of multi-dimensional evaluation metrics: output length, logprobs, and finish reasons. This is useful beyond just accuracy scores. Tracking finish reasons, for example, lets you catch truncation issues or unexpected stops that a top-line score would hide entirely.

An LLM-judge-based config for C-Eval is also included. Using a judge model instead of exact-match scoring gives you more flexibility for evaluating open-ended responses on a widely-used Chinese language benchmark.

Several bug fixes clean up real pain points. OpenAISDKStreaming had issues around output completeness, now addressed across three separate fixes. A buffer-related error in LiveCodeBench evaluation is resolved. Pattern matching in Smolinstruct is corrected. Pyext has been removed as a runtime requirement, which simplifies dependency management.

Infrastructure improvements include a parametrized timeout in OpenAISDK (helpful when evaluating slower models), a meta logger added to OpenICLInferTask for better observability, and updated CI with new dataset coverage and a unified test suite.

Four new contributors made their first commits in this release, which is a healthy sign for a project that depends on community-maintained benchmark configs.

What to do today: if you are running evals on any science or math-heavy model, pull 0.5.2 and wire in SciReasoner or CMPhysBench to get a clearer picture of reasoning quality. If you were previously blocked by the OpenAISDKStreaming output issues, the fixes in this release make it worth retrying. Check the full changelog for PR-level detail before updating your evaluation configs.