OpenCompass 0.5.2 Adds 14 Benchmarks and New Model Support

OpenCompass 0.5.2 lands with a significant expansion of evaluation coverage. If you are building or benchmarking LLMs for scientific, mathematical, or instruction-following tasks, this release directly changes what you can measure.

Fourteen new benchmarks in one release. The list covers a wide range of domains. On the science side: SciReasoner, Biology Instructions, Mol Instructions, and CMPhysBench. For math competition coverage: HMMT2025, AMO-Bench, and IMO-Bench. For instruction-following: IFBench and PI-LLM. Rounding out the set are ATLAS, OpenSWI, ARC_AGI_2, ProcessBench, and LCB_pro. That is a lot of new signal available without writing custom evaluation configs.

New model and API support. Intern-S1-Pro evaluation examples are now included. TeleChat API inference is also supported. If either is in your stack, you can now plug them into OpenCompass pipelines directly.

Multi-dimensional metric monitoring. The release adds support for tracking output length, logprobs, and finish reasons alongside standard accuracy metrics. This matters for builders who care about more than raw benchmark scores. Knowing why a model stopped generating, or how confident it was, is operationally useful during evaluation runs.

LLM-judge config for C-Eval. A new LLM-judge-based config for C-Eval is included. This expands how you can score C-Eval outputs beyond pattern matching.

Bug fixes that affect reliability. Several fixes are worth noting. OpenAISDKStreaming had issues with output completeness; those are patched across three separate fixes. A buffer-related error in LiveCodeBench evaluation is resolved. A pattern match bug in Smolinstruct is corrected. Pyext has been removed from runtime requirements, which simplifies dependency management.

Infrastructure improvements. The OpenAISDK now supports a parametrized timeout, which gives you more control over long-running inference calls. BigCodeBench gains headers as an input parameter. A meta logger was added to OpenICLInferTask for better observability during inference runs.

Four new contributors joined the project for this release: @zhuangziGiantfish, @xgao922, @Jensen246, and @ccx06.

What to do now. If your evaluation pipeline touches any of the newly supported benchmark domains, pull 0.5.2 and review the new configs. The multi-dimensional metric monitoring (output length, logprobs, finish reasons) is worth enabling even if you are not switching benchmarks. It adds diagnostic depth to existing runs with minimal setup. If you were hitting the OpenAISDKStreaming output completeness bug, this upgrade is not optional.