[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
ByPulseAugur Editorial·
Summary by None
from 25 sources
Researchers are developing new benchmarks and evaluation methods for large language models (LLMs) in mathematical reasoning and educational assessment. New datasets like ESTBook and Math-PT aim to go beyond simple accuracy, focusing on pedagogical reasoning and reducing linguistic bias. Other work explores the impact of self-consistency and reasoning effort on automated scoring, with findings suggesting strategic model selection can optimize accuracy and cost. Additionally, frameworks like MaSTer are being created to automatically generate adversarial test cases for evaluating and improving LLM robustness.
AI
IMPACT
New benchmarks and evaluation techniques will drive more robust and reliable LLM development for educational and reasoning tasks.
RANK_REASON
Multiple arXiv papers introducing new benchmarks, evaluation frameworks, and analysis of LLM performance in mathematical reasoning and educational assessment.
arXiv:2605.00238v1 Announce Type: new Abstract: Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance var…
arXiv cs.CL
TIER_1·Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne·
arXiv:2605.00200v1 Announce Type: new Abstract: Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educatio…
arXiv:2604.26954v1 Announce Type: cross Abstract: Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effor…
arXiv:2505.17056v2 Announce Type: replace-cross Abstract: As large language models (LLMs) are increasingly integrated into educational tools, current evaluations on standardized tests predominantly focus on binary outcome accuracy. Instead, an effective AI tutor must exhibit fait…
Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing gradin…
Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-base…
arXiv:2604.26607v1 Announce Type: new Abstract: As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by sugges…
arXiv cs.CL
TIER_1·Tiago Teixeira, Ana Carolina Erthal, Juan Belieni, Beatriz Canaverde, Diego Mesquita, Miguel Faria, Eliezer de Souza da Silva, Andr\'e F. T. Martins·
arXiv:2604.25926v1 Announce Type: new Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a si…
As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framewor…
arXiv:2601.21225v2 Announce Type: replace Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a…
arXiv cs.CL
TIER_1·Navya Gupta, Rishitej Reddy Vyalla, Avinash Anand, Chhavi Kirtani, Erik Cambria, Zhengchen Zhang, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah·
arXiv:2604.24114v1 Announce Type: new Abstract: Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings …
Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Ind…
arXiv cs.CL
TIER_1·Martin Balko, Jan Greb\'ik, Pavel Hub\'a\v{c}ek, Martin Kouteck\'y, Mat\v{e}j Kripner, V\'aclav Rozho\v{n}, Robert \v{S}\'amal, Adri\'an Z\'ame\v{c}n\'ik·
arXiv:2604.16989v2 Announce Type: replace Abstract: We report new results on eight problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel…
arXiv:2506.05038v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. E…
arXiv cs.AI
TIER_1·Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, Igor Kviatkovsky·
arXiv:2604.22597v1 Announce Type: new Abstract: Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models …
arXiv:2604.21935v1 Announce Type: cross Abstract: Although language models demonstrate remarkable proficiency on mathematical benchmarks, it remains unclear whether this reflects true mathematical reasoning or statistical pattern matching over learning formal syntax. Most existin…
Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmar…
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual datas…
ArXiv | Models | Data | Code | Blog | Sample Explorer Today we release Llemma: 7 billion and 34 billion parameter language models for mathematics. The Llemma models were initialized with Code Llama weights, then trained on the Proof-Pile II, a 55 billion token dataset of mathemat…
**Epoch AI** collaborated with over **60 leading mathematicians** to create the **FrontierMath benchmark**, a fresh set of hundreds of original math problems with easy-to-verify answers, aiming to challenge current AI models. The benchmark reveals that all tested models, includin…
#deepseek #llm #grpo GRPO is one of the core advancements used in Deepseek-R1, but was introduced already last year in this paper that uses a combination of new RL techniques and iterative data collection to achieve remarkable performance on mathematics benchmarks with just a 7B …
LLM-as-a-Judge Framework Fixes Math Evaluation Failures Researchers propose an LLM-as-a-judge framework for evaluating math reasoning that beats rule-based symbolic comparison, fixing failures in Lighteval and SimpleRL. This enables more accurate benchmark https:// gentic.news/ar…
Version Sentinel: A Claude Code Plugin That Blocks Hallucinated Package Versions Version Sentinel uses Claude Code's hook system to intercept dependency changes and require version verification, preventing supply-chain risks from hallucinated package versions. https:// gentic.new…