China-based AI lab DeepSeek has launched its latest open-weight model, DeepSeekMath-V2, which is now available under the Apache 2.0 license.
According to the AI lab, the model delivers strong theorem-proving capabilities in mathematics and has achieved gold-level performance in the International Mathematics Olympiad (IMO) 2025.
The model successfully solved 5 out of the 6 problems at IMO 2025. “Imagine owning the brain of one of the world’s top mathematicians for free,” said Clement Delangue, co-founder and CEO of Hugging Face, in a post on X.
He added that, to his knowledge, no existing chatbot or API currently provides access to an IMO 2025 gold-medalist model.
In July, advanced models from Google DeepMind’s Gemini series and an experimental reasoning model from OpenAI also reached gold status on the IMO 2025 benchmark. Similar to DeepSeek’s latest model, both systems solved 5 out of the 6 problems, making them the first AI models to achieve gold-level scores.
The IMO is widely considered the world’s most challenging high-school mathematics competition. In the 2025 edition, only 72 of the 630 participating students earned gold medals.
Beyond its IMO 2025 performance, DeepSeekMath-V2 also excelled in China’s most challenging national contest, the China Mathematical Olympiad (CMO), and delivered near-perfect results on the undergraduate-level Putnam exam.
DeepSeek noted that on the 2024 Putnam exam one of the most prestigious undergraduate mathematics competitions the model fully solved 11 of the 12 problems and answered the remaining one with only minor errors, earning a score of 118 out of 120 and surpassing the top human score of 90.
DeepSeek contends that while recent AI models perform well on math benchmarks such as AIME and HMMT, they often fall short in demonstrating strong underlying reasoning.
Many mathematical tasks, such as theorem proving, demand rigorous step-by-step reasoning rather than simple numerical answers, making final-answer-based evaluation insufficient.
To overcome this limitation, DeepSeek highlights the importance of models capable of evaluating and improving their own reasoning. The team notes that “self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions.
For context, test-time compute refers to using substantial computational resources during inference not training to allow the model to reason more deeply, explore multiple solution paths, and refine its final answer.
DeepSeek’s method involves training a dedicated verifier that evaluates the quality of proofs rather than final answers. This verifier then guides a separate proof-generation model, which is rewarded only when it identifies and corrects its own mistakes never for concealing them.
As the paper describes, they “train a proof generator using the verifier as the reward model, encouraging it to detect and fix as many issues as possible in its own proofs before finalising them.
To avoid overfitting to its own verification system, DeepSeek continuously increases the difficulty of the checking process.
It achieves this by allocating more computational resources and automatically labeling challenging proofs, allowing the verifier to evolve in tandem with the generator.
According to the team, this approach “allows them to scale verification compute to automatically label new, hard-to-verify proofs, generating training data that further enhances the verifier.
The model’s weights are available for download on Hugging Face. That’s democratisation of AI and knowledge at its best, literally,” said Delangue.
DeepSeek gained attention after releasing a low-cost, open-source model that competed with US AI systems. Its launch of the DeepSeek-R1 reasoning model raised questions about whether open models could challenge the commercial advantage of closed systems, briefly unsettling investor confidence in AI leaders like NVIDIA.









