[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

By PulseAugur Editorial · Summary by None from 25 sources

Researchers are developing new benchmarks and evaluation methods for large language models (LLMs) in mathematical reasoning and educational assessment. New datasets like ESTBook and Math-PT aim to go beyond simple accuracy, focusing on pedagogical reasoning and reducing linguistic bias. Other work explores the impact of self-consistency and reasoning effort on automated scoring, with findings suggesting strategic model selection can optimize accuracy and cost. Additionally, frameworks like MaSTer are being created to automatically generate adversarial test cases for evaluating and improving LLM robustness. AI

Summary written by None from 25 sources. How we write summaries →

IMPACT New benchmarks and evaluation techniques will drive more robust and reliable LLM development for educational and reasoning tasks.

RANK_REASON Multiple arXiv papers introducing new benchmarks, evaluation frameworks, and analysis of LLM performance in mathematical reasoning and educational assessment.

Read on Yannic Kilcher →

paper
other

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

COVERAGE [25]

Hugging Face Blog TIER_1 · 2025-12-04 00:00

DeepMath: A lightweight math reasoning Agent with smolagents
arXiv cs.CL TIER_1 · Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne · 2026-05-04 04:00

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

arXiv:2605.00238v1 Announce Type: new Abstract: Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance var…
arXiv cs.CL TIER_1 · Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne · 2026-05-04 04:00

Confidence Estimation in Automatic Short Answer Grading with LLMs

arXiv:2605.00200v1 Announce Type: new Abstract: Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educatio…
arXiv cs.AI TIER_1 · Scott Frohn · 2026-05-01 04:00

The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

arXiv:2604.26954v1 Announce Type: cross Abstract: Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effor…
arXiv cs.AI TIER_1 · Luoxi Tang, Tharunya Sundar, Yuqiao Meng, Shuai Yang, Ankita Patra, Lakshmi Manohar Chippada, Jiqian Zhao, Yi Li, Weicheng Ma, Zhaohan Xi · 2026-05-01 04:00

From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

arXiv:2505.17056v2 Announce Type: replace-cross Abstract: As large language models (LLMs) are increasingly integrated into educational tools, current evaluations on standardized tests predominantly focus on binary outcome accuracy. Instead, an effective AI tutor must exhibit fait…
arXiv cs.CL TIER_1 · Ulf Kroehne · 2026-04-30 21:16

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing gradin…
arXiv cs.CL TIER_1 · Ulf Kroehne · 2026-04-30 20:26

Confidence Estimation in Automatic Short Answer Grading with LLMs

Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-base…
arXiv cs.AI TIER_1 · Jatin Bhusal, Nancy Mahatha, Aayush Acharya, Raunak Regmi · 2026-04-30 04:00

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

arXiv:2604.26607v1 Announce Type: new Abstract: As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by sugges…
arXiv cs.CL TIER_1 · Tiago Teixeira, Ana Carolina Erthal, Juan Belieni, Beatriz Canaverde, Diego Mesquita, Miguel Faria, Eliezer de Souza da Silva, Andr\'e F. T. Martins · 2026-04-30 04:00

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

arXiv:2604.25926v1 Announce Type: new Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a si…
arXiv cs.AI TIER_1 · Raunak Regmi · 2026-04-29 12:36

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framewor…
arXiv cs.CL TIER_1 · Tianyi Xu, Kosei Uemura, Alfred Malengo Kondoro, Tadesse Destaw Belay, Catherine Nana Nyaah Essuman, Ifeoma Okoh, Ganiyat Afolabi, Ayodele Awokoya, David Ifeoluwa Adelani · 2026-04-29 04:00

MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

arXiv:2601.21225v2 Announce Type: replace Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a…
arXiv cs.CL TIER_1 · Navya Gupta, Rishitej Reddy Vyalla, Avinash Anand, Chhavi Kirtani, Erik Cambria, Zhengchen Zhang, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah · 2026-04-28 04:00

IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning

arXiv:2604.24114v1 Announce Type: new Abstract: Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings …
arXiv cs.CL TIER_1 · Rajiv Ratn Shah · 2026-04-27 07:10

IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning

Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Ind…
arXiv cs.CL TIER_1 · Martin Balko, Jan Greb\'ik, Pavel Hub\'a\v{c}ek, Martin Kouteck\'y, Mat\v{e}j Kripner, V\'aclav Rozho\v{n}, Robert \v{S}\'amal, Adri\'an Z\'ame\v{c}n\'ik · 2026-04-27 04:00

Bolzano: Case Studies in LLM-Assisted Mathematical Research

arXiv:2604.16989v2 Announce Type: replace Abstract: We report new results on eight problems in mathematics and theoretical computer science, produced with the assistance of Bolzano, an open-source multi-agent LLM system. Bolzano orchestrates rounds of interaction between parallel…
arXiv cs.CL TIER_1 · Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Ma Shuguang, Zhaoqian Dai, Hailiang Huang, Yun Chen, Guanhua Chen · 2026-04-27 04:00

Toward Automated Robustness Evaluation of Mathematical Reasoning

arXiv:2506.05038v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. E…
arXiv cs.AI TIER_1 · Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, Igor Kviatkovsky · 2026-04-27 04:00

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

arXiv:2604.22597v1 Announce Type: new Abstract: Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models …
arXiv cs.LG TIER_1 · Michael Cooper, Samuel Cooper · 2026-04-27 04:00

Math Takes Two: A test for emergent mathematical reasoning in communication

arXiv:2604.21935v1 Announce Type: cross Abstract: Although language models demonstrate remarkable proficiency on mathematical benchmarks, it remains unclear whether this reflects true mathematical reasoning or statistical pattern matching over learning formal syntax. Most existin…
arXiv cs.AI TIER_1 · Igor Kviatkovsky · 2026-04-24 14:25

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmar…
Hugging Face Daily Papers TIER_1 · 2026-04-20 17:59

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual datas…
EleutherAI Blog TIER_1 · 2023-10-17 02:00

Llemma: An Open Language Model For Mathematics

ArXiv | Models | Data | Code | Blog | Sample Explorer Today we release Llemma: 7 billion and 34 billion parameter language models for mathematics. The Llemma models were initialized with Code Llama weights, then trained on the Proof-Pile II, a 55 billion token dataset of mathemat…
Smol AINews TIER_1 · 2024-11-12 01:33

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

**Epoch AI** collaborated with over **60 leading mathematicians** to create the **FrontierMath benchmark**, a fresh set of hundreds of original math problems with easy-to-verify answers, aiming to challenge current AI models. The benchmark reveals that all tested models, includin…
Yannic Kilcher TIER_1 · Yannic Kilcher · 2025-01-26 14:03

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

#deepseek #llm #grpo GRPO is one of the core advancements used in Deepseek-R1, but was introduced already last year in this paper that uses a combination of new RL techniques and iterative data collection to achieve remarkable performance on mathematics benchmarks with just a 7B …
Mastodon — fosstodon.org TIER_1 · [email protected] · 2026-04-28 08:01

LLM-as-a-Judge Framework Fixes Math Evaluation Failures Researchers propose an LLM-as-a-judge framework for evaluating math reasoning that beats rule-based symb

LLM-as-a-Judge Framework Fixes Math Evaluation Failures Researchers propose an LLM-as-a-judge framework for evaluating math reasoning that beats rule-based symbolic comparison, fixing failures in Lighteval and SimpleRL. This enables more accurate benchmark https:// gentic.news/ar…
Mastodon — fosstodon.org TIER_1 · [email protected] · 2026-04-28 08:01

Version Sentinel: A Claude Code Plugin That Blocks Hallucinated Package Versions Version Sentinel uses Claude Code's hook system to intercept dependency changes

Version Sentinel: A Claude Code Plugin That Blocks Hallucinated Package Versions Version Sentinel uses Claude Code's hook system to intercept dependency changes and require version verification, preventing supply-chain risks from hallucinated package versions. https:// gentic.new…
Mastodon — mastodon.social TIER_1 日本語(JA) · [email protected] · 2026-04-28 04:00

From High School Mathematics to Cutting-Edge AI — An Overview of the 12 Chapters of 'Mathematically Thinking by Self-Study'

高校数学から最先端AIまで ——『独学で鍛える数理思考』全12章の全体像 https:// gihyo.jp/article/2026/04/mathe matical-thinking-01?utm_source=feed # gihyo # 技術評論社 # gihyo_jp # 数理思考 # AI

COVERAGE [25]

RELATED ENTITIES

RELATED TOPICS