New research boosts LLM reasoning with speculative methods and physical insights

Hugging Face Blog TIER_1 · 2025-11-19 05:19

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Hugging Face Blog TIER_1 · 2025-07-10 12:54

Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models

Hugging Face Blog TIER_1 · 2025-02-04 00:00

DABStep: Data Agent Benchmark for Multi-step Reasoning

Hugging Face Blog TIER_1 · 2024-04-18 00:00

Welcome Llama 3 - Meta's new open LLM

Hugging Face Blog TIER_1 · 2024-02-02 00:00

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

arXiv cs.AI TIER_1 · Paria Rashidinejad · 2026-05-12 17:51

Solve the Loop: Attractor Models for Language and Reasoning

Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to …

arXiv cs.CL TIER_1 · Jeany Son · 2026-05-12 07:14

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer…

arXiv cs.CL TIER_1 · Jun Huang · 2026-05-12 06:54

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred fo…

Hugging Face Daily Papers TIER_1 · 2026-05-11 16:46

The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet…

arXiv cs.AI TIER_1 · Kuan-Hao Huang · 2026-05-11 16:46

The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet…

arXiv cs.LG TIER_1 · Meng Li · 2026-05-11 08:45

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronizat…

Hugging Face Daily Papers TIER_1 · 2026-05-11 03:28

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how …

arXiv cs.CL TIER_1 · Shuhao Zhang · 2026-05-11 03:28

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how …

arXiv cs.CL TIER_1 · Hua Shen · 2026-05-11 02:32

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure …

量子位 (QbitAI) TIER_1 中文(ZH) · 思邈 · 2026-05-11 01:51

Embodied Large Model R1 Moment: LIBERO Terminator, A New Paradigm of Physical Reasoning Behind 99.9%

真正学会了在隐空间里进行“物理思考”

arXiv cs.CL TIER_1 · Kumar Lakshmipathi · 2026-05-10 15:56

Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols

When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched …

arXiv cs.CL TIER_1 · Yue Zhao · 2026-05-10 12:26

Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe …

arXiv cs.CL TIER_1 · Dajun Zhang · 2026-05-10 11:54

Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a di…

arXiv cs.AI TIER_1 · Dan O'Malley · 2026-05-08 17:48

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formal…

arXiv cs.AI TIER_1 · Mark Coates · 2026-05-08 17:01

Abductive Reasoning with Probabilistic Commonsense

Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps…

arXiv cs.AI TIER_1 · Jing Tang · 2026-05-08 14:38

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses…

arXiv cs.AI TIER_1 · Jes Frellsen · 2026-05-08 14:16

Tracing Uncertainty in Language Model "Reasoning"

Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treatin…

arXiv cs.CL TIER_1 · Yunfang Wu · 2026-05-08 12:31

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how co…

arXiv cs.CL TIER_1 · Junpei Komiyama · 2026-05-08 12:28

Reliable Chain-of-Thought via Prefix Consistency

Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, w…

arXiv cs.CL TIER_1 · Yujiu Yang · 2026-05-08 09:08

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy…

arXiv cs.CL TIER_1 · Junnan Zhu · 2026-05-08 06:23

LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous stat…

arXiv cs.CL TIER_1 · Hung-yi Lee · 2026-05-08 06:15

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, an…

arXiv cs.CL TIER_1 · Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov · 2026-05-08 04:00

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

arXiv:2605.06638v1 Announce Type: cross Abstract: Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments.…

arXiv cs.CL TIER_1 · Fan Huang · 2026-05-08 04:00

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

arXiv:2605.05737v1 Announce Type: cross Abstract: Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across…

arXiv cs.CL TIER_1 · Jaehoon Kim, Dongha Lee · 2026-05-08 04:00

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

arXiv:2605.06188v1 Announce Type: cross Abstract: On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-…

arXiv cs.CL TIER_1 · Xinyu Wang, Changzhi Sun, Lian Cheng, Yuanbin Wu, Dell Zhang, Xiaoling Wang, Xuelong Li · 2026-05-08 04:00

Logic-Regularized Verifier Elicits Reasoning from LLMs

arXiv:2605.05893v1 Announce Type: new Abstract: Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wep…

arXiv cs.CL TIER_1 · Nicole Lincoln, Nick Whitehouse, Jaron Mar, Rivindu Perera · 2026-05-08 04:00

A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

arXiv:2605.05532v1 Announce Type: new Abstract: This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixt…

arXiv cs.LG TIER_1 · Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou · 2026-05-08 04:00

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

arXiv:2512.18857v3 Announce Type: replace-cross Abstract: Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelin…

arXiv cs.LG TIER_1 · Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi · 2026-05-08 04:00

KaVa: Latent Reasoning via Compressed KV-Cache Distillation

arXiv:2510.02312v2 Announce Type: replace Abstract: Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifac…

arXiv cs.LG TIER_1 · William T. Redman, Erik C. Johnson, Brian Robinson · 2026-05-08 04:00

Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

arXiv:2605.05495v1 Announce Type: new Abstract: Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible comp…

arXiv cs.LG TIER_1 · Pratik Deshmukh, Atirek Gupta · 2026-05-08 04:00

On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

arXiv:2605.05438v1 Announce Type: new Abstract: Standard fine-tuning of transformer models on causal reasoning tasks leads to catastrophic model collapse, where models learn trivial solutions such as always predicting "Yes" or "No" regardless of input structure. We demonstrate th…

arXiv cs.CL TIER_1 · Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang, Yuchen Fan, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng, Yun Luo, Ganqu Cui · 2026-05-08 04:00

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

arXiv:2605.06326v1 Announce Type: new Abstract: Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong th…

arXiv cs.CL TIER_1 · David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal · 2026-05-08 04:00

Multimodal Fact-Level Attribution for Verifiable Reasoning

arXiv:2602.11509v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and v…

arXiv cs.AI TIER_1 · Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome · 2026-05-08 04:00

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

arXiv:2605.05407v1 Announce Type: new Abstract: Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which of…

arXiv cs.AI TIER_1 · Sai Babu Patarlapalli, Surya Teja Avvaru · 2026-05-08 04:00

BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

arXiv:2605.05561v1 Announce Type: new Abstract: Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of new…

arXiv cs.AI TIER_1 · Xiaomin Li, Jianheng Hou, Zheyuan Deng, Zhiwei Zhang, Taoran Li, Binghang Lu, Bing Hu, Yunhan Zhao, Yuexing Hao · 2026-05-08 04:00

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

arXiv:2605.05678v1 Announce Type: new Abstract: Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in re…

arXiv cs.AI TIER_1 · Richmond Sin Jing Xuan, Rishabh Bhardwaj, Soujanya Poria · 2026-05-08 04:00

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

arXiv:2605.06165v1 Announce Type: new Abstract: As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-w…

arXiv cs.AI TIER_1 · Marc Boubnovski Martell, Josefa Lia Stoisser, Kaspar M\"artens, Jialin Yu, Robert Kitchen, Philip Torr, Jesper Ferkinghoff-Borg · 2026-05-08 04:00

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

arXiv:2605.06308v1 Announce Type: new Abstract: Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry …

arXiv cs.LG TIER_1 · Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev · 2026-05-08 04:00

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

arXiv:2508.16745v3 Announce Type: replace Abstract: Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation …

arXiv cs.LG TIER_1 · Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang, Jiaxin Huang · 2026-05-08 04:00

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

arXiv:2605.05566v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequen…

arXiv cs.LG TIER_1 · Aymen Echarghaoui, Dongxia Wu, Emily B. Fox · 2026-05-08 04:00

BALAR : A Bayesian Agentic Loop for Active Reasoning

arXiv:2605.05386v1 Announce Type: cross Abstract: Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled me…

arXiv cs.LG TIER_1 · Yuhang Lai, Jiazhan Feng, Yee Whye Teh, Ning Miao · 2026-05-08 04:00

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

arXiv:2605.06660v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training a…

arXiv cs.AI TIER_1 · Ning Miao · 2026-05-07 17:58

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Exis…

arXiv cs.AI TIER_1 · Abulhair Saparov · 2026-05-07 17:48

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reas…

arXiv cs.CL TIER_1 · Ganqu Cui · 2026-05-07 14:23

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. …

arXiv cs.CL TIER_1 · Dongha Lee · 2026-05-07 13:04

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However…

arXiv cs.CL TIER_1 · Xuelong Li · 2026-05-07 09:03

Logic-Regularized Verifier Elicits Reasoning from LLMs

Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regulariz…

arXiv cs.LG TIER_1 · Khouloud Saadi, Di Wang · 2026-05-07 04:00

Validity-Calibrated Reasoning Distillation

arXiv:2605.04078v1 Announce Type: new Abstract: Reasoning distillation aims to transfer multi-step reasoning capabilities from large language models to smaller, more efficient ones. While recent methods have shown promising gains, they typically rely on static teacher-student hie…

arXiv cs.LG TIER_1 · Igor Rivin · 2026-05-07 04:00

Probing Structural Mathematical Reasoning in Language Models with Algebraic Trapdoors

arXiv:2605.04352v1 Announce Type: new Abstract: We introduce a benchmark suite for evaluating structural mathematical reasoning in language models, built on subgroup-construction problems in SL(3, Z) with cryptographic-style verifier-prover asymmetry. Each instance presents a fin…

arXiv cs.LG TIER_1 · Ole-Christoffer Granmo, Youmna Abdelwahab, Per-Arne Andersen, Karl Audun K. Borgersen, Paul F. A. Clarke, Kunal Dumbre, Ylva Gr{\o}nnings{\ae}ter, Vojtech Halenka, Runar Helin, Lei Jiao, Ahmed Khalid, Rebekka Omslandseter, Rupsa Saha, Mayur Shende, Xuan Z · 2026-05-07 04:00

The Tsetlin Machine Goes Deep: Logical Learning and Reasoning With Graphs

arXiv:2507.14874v2 Announce Type: replace Abstract: Pattern recognition with concise and flat AND-rules makes the Tsetlin Machine (TM) both interpretable and efficient, while the power of Tsetlin automata enables accuracy comparable to deep learning on an increasing number of dat…

arXiv cs.CL TIER_1 · Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang · 2026-05-07 04:00

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

arXiv:2508.04204v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. C…

arXiv cs.AI TIER_1 · Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, Chelsea Finn · 2026-05-06 04:00

Poly-EPO: Training Exploratory Reasoning Models

arXiv:2604.17654v3 Announce Type: replace Abstract: Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for…

arXiv cs.AI TIER_1 · Eric H. C. Chow · 2026-05-06 04:00

Retrieval and Multi-Hop Reasoning in 1M-Token Context Windows: Evaluating LLMs on Classical Chinese Text

arXiv:2605.02173v1 Announce Type: new Abstract: We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures…

arXiv cs.AI TIER_1 · Kei Nishimura-Gasparian, Robert McCarthy, David Lindner · 2026-05-06 04:00

Towards Understanding Specification Gaming in Reasoning Models

arXiv:2605.02269v1 Announce Type: new Abstract: Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where …

arXiv cs.AI TIER_1 · Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan · 2026-05-06 04:00

ABox Abduction for Inconsistent Knowledge Bases under Repair Semantics

arXiv:2605.01341v1 Announce Type: cross Abstract: Given a knowledge base (KB) with a non-entailed fact, the ABox abduction problem asks for possible extensions of the KB that would entail this fact. This problem has many applications, ranging from diagnosis to explainability and …

arXiv cs.AI TIER_1 · Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder · 2026-05-06 04:00

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

arXiv:2509.12464v2 Announce Type: replace Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produce…

arXiv cs.AI TIER_1 · Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao · 2026-05-06 04:00

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

arXiv:2601.04809v5 Announce Type: replace Abstract: Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progr…

arXiv cs.AI TIER_1 · Yunjian Zhang, Sudong Wang, Yang Li, Peiran Xu, Conghao Zhou, Xiaoyue Ma, Jianing Li, Yao Zhu · 2026-05-06 04:00

Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement

arXiv:2602.00815v2 Announce Type: replace Abstract: Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reaso…

arXiv cs.AI TIER_1 · Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu · 2026-05-06 04:00

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

arXiv:2603.10384v2 Announce Type: replace Abstract: Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematic…

arXiv cs.AI TIER_1 · Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang · 2026-05-06 04:00

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

arXiv:2603.17368v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In t…

arXiv cs.LG TIER_1 · Manuel Vargas Guzm\'an, Jakub Szymanik, Maciej Malicki · 2026-05-06 04:00

Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic

arXiv:2510.09472v2 Announce Type: replace-cross Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications such as logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: c…

arXiv cs.CL TIER_1 · Rose Sathyanathan, Kinshuk Vasisht, Danish Pruthi · 2026-05-06 04:00

Evaluating Reasoning Models for Queries with Presuppositions

arXiv:2605.03050v1 Announce Type: new Abstract: Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fa…

arXiv cs.CL TIER_1 · Jiaqi Wei, Xuehang Guo, Pengfei Yu, Xiang Zhang, Wanli Ouyang, Siqi Sun, Qingyun Wang, Chenyu You · 2026-05-06 04:00

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

arXiv:2605.03314v1 Announce Type: new Abstract: In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{…

arXiv cs.CL TIER_1 · Daniel Drucker, Kyle Mahowald · 2026-05-06 04:00

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

arXiv:2605.03936v1 Announce Type: new Abstract: Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: o…

arXiv cs.CL TIER_1 · Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia · 2026-05-06 04:00

RAG over Thinking Traces Can Improve Reasoning Tasks

arXiv:2605.03344v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumpti…

arXiv cs.CL TIER_1 · Kyle Mahowald · 2026-05-05 16:26

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a…

arXiv cs.CL TIER_1 · Matei Zaharia · 2026-05-05 04:03

RAG over Thinking Traces Can Improve Reasoning Tasks

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG …

arXiv cs.AI TIER_1 · Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong, Serge Panev, Chen Gong, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung, Alan Wee-Chung Liew, Shirui Pan · 2026-05-05 04:00

G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge

arXiv:2509.24276v4 Announce Type: replace Abstract: Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs…

arXiv cs.AI TIER_1 · Henry Han, Xiyang Liu, Xiaodong Wang, Fei Han, Xiaodong Li · 2026-05-05 04:00

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

arXiv:2602.13595v2 Announce Type: replace Abstract: Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile ($E \propto \mathrm{bits}$). In this paper, we demonstrate tha…

arXiv cs.CL TIER_1 · Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, Sam Bayless · 2026-05-05 04:00

VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

arXiv:2601.20055v2 Announce Type: replace Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers t…

arXiv cs.CL TIER_1 · Qiuyu Tian, Zequn Liu, Yiding Li, Fengyi Chen, Zequn Liu, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia · 2026-05-05 04:00

STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie

arXiv:2601.08510v3 Announce Type: replace Abstract: Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question ans…

arXiv cs.CL TIER_1 · Shanglin Wu, Lihui Liu, Jinho D. Choi, Kai Shu · 2026-05-05 04:00

Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

arXiv:2509.03540v3 Announce Type: replace Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) paradigms mitigate this issue by incorporating external …

arXiv cs.CL TIER_1 · Ren Zhuang · 2026-05-05 04:00

Adaptive GoGI-Skip: Coupling Goal-Gradient Importance with Dynamic Uncertainty for Efficient Reasoning

arXiv:2505.08392v3 Announce Type: replace Abstract: Chain-of-Thought (CoT) prompting trades inference speed for reasoning accuracy. Existing compressors force a compromise as static gradient techniques treat tokens independently, severing sequential logic, while uncertainty-based…

arXiv cs.CL TIER_1 · Xuan Shen, Yizhou Wang, Yufa Zhou, Xiangxi Shi, Pu Zhao, Yanzhi Wang, Jiuxiang Gu · 2026-05-05 04:00

Efficient Reasoning with Hidden Thinking

arXiv:2501.19201v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces signifi…

arXiv cs.CL TIER_1 · Tairan Fu, Javier Conde, Gonzalo Mart\'inez, Mar\'ia Grandury, Pedro Reviriego · 2026-05-05 04:00

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong

arXiv:2501.09775v3 Announce Type: replace Abstract: Multiple Choice Question (MCQ) tests are among the most used methods for evaluating large language models (LLMs). Besides checking the correctness of the selected answer, evaluations often consider the model's confidence through…

arXiv cs.CL TIER_1 · Munachiso Samuel Nwadike, Zangir Iklassov, Kareem Ali, Rifo Genadi, Kentaro Inui · 2026-05-05 04:00

Measuring AI Reasoning: A Guide for Researchers

arXiv:2605.02442v1 Announce Type: cross Abstract: In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alon…

arXiv cs.CL TIER_1 · Yilei Chen, Sharut Gupta, Yannis Paschalidis, Ayush Sekhari, Aldo Pacchiano · 2026-05-05 04:00

When Less is Enough: Efficient Inference via Collaborative Reasoning

arXiv:2605.01111v1 Announce Type: cross Abstract: In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to per…

arXiv cs.CL TIER_1 · Yongrui Chen, Yangyang Ma, Xiaoying Huang, Shenyu Zhang, Huajun Chen, Haofen Wang, Guilin Qi · 2026-05-05 04:00

StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models

arXiv:2605.01939v1 Announce Type: new Abstract: Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the…

arXiv cs.CL TIER_1 · Nikolaos Giarelis, Charalampos Mastrokostas, Nikos Karacapilidis · 2026-05-05 04:00

Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models

arXiv:2605.01870v1 Announce Type: new Abstract: Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their…

arXiv cs.CL TIER_1 · Kwan Soo Shin · 2026-05-05 04:00

The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

arXiv:2605.01704v1 Announce Type: new Abstract: When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents itera…

arXiv cs.CL TIER_1 · Zebin Guo, Weidong Geng, Ruichen Mao · 2026-05-05 04:00

FT-RAG: A Fine-grained Retrieval-Augmented Generation Framework for Complex Table Reasoning

arXiv:2605.01495v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding responses in external knowledge during inference. However, conventiona RAG systems under-perform on structured tabular data, largely due to coar…

arXiv cs.CL TIER_1 (AF) · Sangkwon Park, Donghun Kang, Jisoo Mok, Sungroh Yoon · 2026-05-05 04:00

Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

arXiv:2605.01399v1 Announce Type: new Abstract: The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to b…

arXiv cs.CL TIER_1 · Susmit Das · 2026-05-05 04:00

TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning

arXiv:2601.05300v2 Announce Type: replace-cross Abstract: Reasoning-oriented language models typically expose explicit reasoning as a long, front-loaded chain of "thinking" tokens before the main output, either always enabled or externally toggled at inference time. Although this…

arXiv cs.CL TIER_1 · Linjuan Wu, Haoran Wei, Jialong Tang, Shuang Luo, Baosong Yang, Yongliang Shen, Weiming Lu · 2026-05-05 04:00

Language as a Latent Variable for Reasoning Optimization

arXiv:2604.21593v2 Announce Type: replace Abstract: As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the …

arXiv cs.LG TIER_1 · Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun · 2026-05-05 04:00

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

arXiv:2605.02122v1 Announce Type: new Abstract: Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator…

arXiv cs.LG TIER_1 · Simone Papicchio, Simone Rossi, Luca Cagliero, Paolo Papotti · 2026-05-05 04:00

Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

arXiv:2504.15077v5 Announce Type: replace Abstract: Large Language Models (LLMs) can translate natural language into SQL, but small models struggle with multi-table and complex queries in Zero-Shot Learning (ZSL) settings. While Supervised Fine-Tuning (SFT) helps, it falls short …

arXiv cs.LG TIER_1 · Lucas Dionisopoulos, Nicklas Majamaki, Prithviraj Ammanabrolu · 2026-05-05 04:00

How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess

arXiv:2604.05134v2 Announce Type: replace Abstract: We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets influences language model performance in chess. …

arXiv cs.AI TIER_1 · Yiyang Wei, Tingyu Song, Siyue Zhang, Yilun Zhao · 2026-05-05 04:00

A Survey of Reasoning-Intensive Retrieval: Progress and Challenges

arXiv:2605.00063v1 Announce Type: cross Abstract: Reasoning-Intensive Retrieval (RIR) targets retrieval settings where relevance is mediated by latent inferential links between a query and supporting evidence, rather than semantic similarity. Motivated by the emergent reasoning a…

arXiv cs.CL TIER_1 · Chenyu You · 2026-05-05 02:59

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early stream…

arXiv cs.CL TIER_1 · Danish Pruthi · 2026-05-04 18:15

Evaluating Reasoning Models for Queries with Presuppositions

Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fail to challenge such erroneous assumptions, and …

arXiv cs.CL TIER_1 · Kentaro Inui · 2026-05-04 10:42

Measuring AI Reasoning: A Guide for Researchers

In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reason…

Hugging Face Daily Papers TIER_1 · 2026-05-04 06:22

Towards Understanding Specification Gaming in Reasoning Models

Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended act…

arXiv cs.LG TIER_1 · Arunabh Srivastava (Amir), Mohammad A. (Amir), Khojastepour, Srimat Chakradhar, Sennur Ulukus · 2026-05-04 04:00

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

arXiv:2605.00798v1 Announce Type: new Abstract: Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language pla…

arXiv cs.LG TIER_1 · Jugal Gajjar, Kamalasankari Subramaniakuppusamy · 2026-05-04 04:00

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

arXiv:2605.00199v1 Announce Type: cross Abstract: When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning w…

arXiv cs.LG TIER_1 · Yuxuan Gao, Megan Wang, Yi Ling Yu · 2026-05-04 04:00

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

arXiv:2605.00300v1 Announce Type: cross Abstract: Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quan…

arXiv cs.CL TIER_1 · Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu · 2026-05-04 04:00

Exploring the System 1 Thinking Capability of Large Reasoning Models

arXiv:2504.10368v4 Announce Type: replace Abstract: This paper explores the system 1 thinking capability of Large Reasoning Models (LRMs), the intuitive ability to respond efficiently with minimal token usage. While existing LRMs rely on long-chain reasoning and excel at complex …

arXiv cs.CL TIER_1 · Diane Tchuindjo, Omar Khattab · 2026-05-04 04:00

Reasoning-Intensive Regression

arXiv:2508.21762v3 Announce Type: replace Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks s…

arXiv cs.CL TIER_1 · Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu · 2026-05-04 04:00

Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

arXiv:2602.03141v3 Announce Type: replace Abstract: While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and compu…

Hugging Face Daily Papers TIER_1 · 2026-05-04 01:03

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yie…

arXiv cs.CL TIER_1 · Guilin Qi · 2026-05-03 15:45

StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models

Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In…

arXiv cs.CL TIER_1 · Nikos Karacapilidis · 2026-05-03 13:32

Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models

Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enab…

arXiv cs.CL TIER_1 · Kwan Soo Shin · 2026-05-03 04:12

The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to …

arXiv cs.CL TIER_1 · Sennur Ulukus · 2026-05-01 17:29

RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through co…

arXiv cs.AI TIER_1 · Chengcao Yang, Jun Chen · 2026-05-01 04:00

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

arXiv:2604.27644v1 Announce Type: cross Abstract: We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introd…

arXiv cs.AI TIER_1 · Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras · 2026-05-01 04:00

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

arXiv:2604.27251v1 Announce Type: cross Abstract: Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoni…

arXiv cs.AI TIER_1 · Shouren Wang, Wang Yang, Chuang Ma, Debargha Ganguly, Vikash Singh, Chaoda Song, Xinpeng Li, Xianxuan Long, Vipin Chaudhary, Xiaotian Han · 2026-05-01 04:00

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

arXiv:2604.27201v1 Announce Type: cross Abstract: Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Ex…

arXiv cs.AI TIER_1 · Adam Ishay, Joohyung Lee · 2026-05-01 04:00

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

arXiv:2604.27960v1 Announce Type: new Abstract: Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While …

arXiv cs.AI TIER_1 · Yang Zhang, Jiangyuan Zhao, Chenyou Fan, Fangzheng Yan, Tian Li, Haitong Tang, Sen Fu, Xuan'er Wu, Qizhen Weng, Weinan Zhang, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li · 2026-05-01 04:00

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

arXiv:2604.27472v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot lear…

arXiv cs.AI TIER_1 · Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · 2026-05-01 04:00

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

arXiv:2509.23744v4 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added …

arXiv cs.LG TIER_1 · Samuel Pastva, Van-Giang Trinh · 2026-05-01 04:00

BAss: Symbolic Reasoning in Abstract Dialectical Frameworks

arXiv:2604.27576v1 Announce Type: cross Abstract: We present BAss (BDD-based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, an…

arXiv cs.CL TIER_1 · Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He · 2026-05-01 04:00

MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

arXiv:2603.19044v3 Announce Type: replace Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-lev…

arXiv cs.CL TIER_1 · Jingcheng Deng, Zihao Wei, Liang Pang, Junhong Wu, Shicheng Xu, Zenghao Duan, Huawei Shen · 2026-05-01 04:00

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

arXiv:2604.27998v1 Announce Type: cross Abstract: Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning met…

arXiv cs.AI TIER_1 · Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu · 2026-05-01 04:00

Mull-Tokens: Modality-Agnostic Latent Thinking

arXiv:2512.10941v2 Announce Type: replace-cross Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images ar…

arXiv cs.AI TIER_1 · Yi Ling Yu · 2026-05-01 00:05

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving s…

arXiv cs.CL TIER_1 · Kamalasankari Subramaniakuppusamy · 2026-04-30 20:25

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidenc…

arXiv cs.CL TIER_1 · Huawei Shen · 2026-04-30 15:23

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and rein…

Hugging Face Daily Papers TIER_1 · 2026-04-30 14:55

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these…

arXiv cs.AI TIER_1 · Joohyung Lee · 2026-04-30 14:55

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these…

arXiv cs.LG TIER_1 · Jun Chen · 2026-04-30 09:35

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introduce ANCORA, an anchored-curriculum framework in wh…

arXiv cs.LG TIER_1 · Van-Giang Trinh · 2026-04-30 08:29

BAss: Symbolic Reasoning in Abstract Dialectical Frameworks

We present BAss (BDD-based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, and preferred interpretations, as well as two-valued…

arXiv cs.AI TIER_1 · Ioannis Konstantoulas, Dimosthenis Tsimas, Pavlos Peppas, Kyriakos Sgarbas · 2026-04-30 04:00

Auto-Relational Reasoning

arXiv:2604.26507v1 Announce Type: new Abstract: Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could…

arXiv cs.CL TIER_1 · Dongxin Guo, Jikun Wu, Siu Ming Yiu · 2026-04-30 04:00

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

arXiv:2604.26649v1 Announce Type: cross Abstract: Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current R…

arXiv cs.LG TIER_1 · Zhiquan Tan, Yinrong Hong · 2026-04-30 04:00

PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

arXiv:2604.26573v1 Announce Type: new Abstract: Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exp…

arXiv cs.CL TIER_1 · Nikolaos Aletras · 2026-04-29 22:55

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abd…

arXiv cs.CL TIER_1 · Xiaotian Han · 2026-04-29 21:07

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data…

arXiv cs.CL TIER_1 · Siu Ming Yiu · 2026-04-29 13:15

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before r…

arXiv cs.LG TIER_1 · Yinrong Hong · 2026-04-29 11:56

PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit…

Hugging Face Daily Papers TIER_1 · 2026-04-29 10:41

Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will …

arXiv cs.AI TIER_1 · Kyriakos Sgarbas · 2026-04-29 10:12

Auto-Relational Reasoning

Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of…

arXiv cs.LG TIER_1 · Maixent Chenebaux · 2026-04-29 04:00

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

arXiv:2604.24809v1 Announce Type: new Abstract: We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a …

arXiv cs.CL TIER_1 · Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, Zheng Sun, Honghao He, Yuchen Wu, Bihui Yu, Linzhuang Sun, Cheng Tan, Jingxuan Wei · 2026-04-29 04:00

How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

arXiv:2603.01070v2 Announce Type: replace Abstract: Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have dem…

arXiv cs.CL TIER_1 · Yixiao Zhou, Dongzhou Cheng, zhiliang wu, Yi Yang, Yu Cheng, Hehe Fan · 2026-04-29 04:00

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

arXiv:2604.25444v1 Announce Type: new Abstract: Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment m…

arXiv cs.CL TIER_1 · Pratham Singla, Shivank Garg, Ayush Singh, Ishan Garg, Ketan Suhaas Saichandran · 2026-04-29 04:00

Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

arXiv:2510.16340v2 Announce Type: replace Abstract: Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This developme…

arXiv cs.CL TIER_1 · Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang · 2026-04-29 04:00

When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

arXiv:2510.07499v2 Announce Type: replace Abstract: Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents …

arXiv cs.CL TIER_1 · Oliver Kraus, Yash Sarrof, Yuekun Yao, Alexander Koller, Michael Hahn · 2026-04-29 04:00

Barriers to Universal Reasoning With Transformers (And How to Overcome Them)

arXiv:2604.25800v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces long…

arXiv cs.CL TIER_1 · Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen · 2026-04-29 04:00

RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context

arXiv:2506.05205v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used to solve complex tasks where they must retrieve and compose many pieces of in-context information in long reasoning chains. For many real-world tasks it is hard to accurately ga…

arXiv cs.LG TIER_1 · Chu-Cheng Lin, Eugene Ie · 2026-04-29 04:00

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

arXiv:2604.25907v1 Announce Type: new Abstract: Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis…

arXiv cs.AI TIER_1 · Eugene Ie · 2026-04-28 17:52

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ th…

arXiv cs.CL TIER_1 · Michael Hahn · 2026-04-28 16:10

Barriers to Universal Reasoning With Transformers (And How to Overcome Them)

Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied…

arXiv cs.CL TIER_1 · Hehe Fan · 2026-04-28 09:52

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by …

Hugging Face Daily Papers TIER_1 · 2026-04-28 09:52

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by …

arXiv cs.CL TIER_1 · Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan, Fei Huang · 2026-04-28 04:00

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

arXiv:2601.14952v2 Announce Type: replace Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single l…

arXiv cs.AI TIER_1 · Guangxiang Zhao, Qilong Shi, Xusen Xiao, Xiangzheng Zhang, Tong Yang, Lin Sun · 2026-04-28 04:00

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

arXiv:2604.21764v2 Announce Type: replace Abstract: Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliber…

arXiv cs.AI TIER_1 · Dahlia Shehata, Ming Li · 2026-04-28 04:00

Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

arXiv:2604.24512v1 Announce Type: new Abstract: As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode t…

arXiv cs.AI TIER_1 · Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang · 2026-04-28 04:00

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

arXiv:2604.24443v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning…

arXiv cs.CL TIER_1 · Anej Svete, Ashish Sabharwal · 2026-04-28 04:00

On the Reasoning Abilities of Masked Diffusion Language Models

arXiv:2510.13117v3 Announce Type: replace-cross Abstract: Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inher…

arXiv cs.CL TIER_1 · Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang · 2026-04-28 04:00

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

arXiv:2511.08577v2 Announce Type: replace Abstract: Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine e…

arXiv cs.CL TIER_1 · Yuxuan Jiang, Dawei Li, Francis Ferraro · 2026-04-28 04:00

DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

arXiv:2505.13975v4 Announce Type: replace Abstract: While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantia…

arXiv cs.CL TIER_1 · Sharan Ramjee · 2026-04-28 04:00

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

arXiv:2604.23460v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. C…

arXiv cs.CL TIER_1 · Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu · 2026-04-28 04:00

The Power of Power Law: Asymmetry Enables Compositional Reasoning

arXiv:2604.22951v1 Announce Type: cross Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help mo…

arXiv cs.CL TIER_1 · Sercan Karaka\c{s}, Yusuf \c{S}im\c{s}ek · 2026-04-28 04:00

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

arXiv:2604.24665v1 Announce Type: new Abstract: This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze …

arXiv cs.CL TIER_1 · Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu · 2026-04-28 04:00

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

arXiv:2604.24003v1 Announce Type: new Abstract: Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this ove…

arXiv cs.AI TIER_1 · Zichuan Fu, Xian Wu, Guojing Li, Yejing Wang, Yijun Chen, Zihao Zhao, Yixuan Luo, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao · 2026-04-28 04:00

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

arXiv:2604.23623v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches impr…

arXiv cs.CL TIER_1 · Zixuan Wang, Yuanyuan Lei · 2026-04-28 04:00

Knowledge Vector of Logical Reasoning in Large Language Models

arXiv:2604.23877v1 Announce Type: new Abstract: Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze …

arXiv cs.AI TIER_1 · Yijiashun Qi, Xiang Xu, Yuxuan Li · 2026-04-28 04:00

When Corrective Hints Hurt: Prompt Design in Reasoner-Guided Repair of LLM Overcaution on Entailed Negations under OWL~2~DL

arXiv:2604.23398v1 Announce Type: new Abstract: We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers `"unknown'' when the reasoner-entailed answer is ""no'' under \emph{FunctionalProperty} closure or class \emph{disjointne…

arXiv cs.AI TIER_1 · Akihiro Takemura, Katsumi Inoue, Masaaki Nishino · 2026-04-28 04:00

Constraint-Based Analysis of Reasoning Shortcuts in Neurosymbolic Learning

arXiv:2604.23377v1 Announce Type: new Abstract: Neurosymbolic systems can satisfy logical constraints during learning without achieving the intended concept-label correspondence; this is a problem known as reasoning shortcuts. We formalize reasoning shortcuts as a constraint sati…

arXiv cs.CL TIER_1 · Yusuf Şimşek · 2026-04-27 16:26

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly…

Hugging Face Daily Papers TIER_1 · 2026-04-27 16:26

Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly…

arXiv cs.AI TIER_1 · Ming Li · 2026-04-27 14:13

Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder-only autore…

arXiv cs.AI TIER_1 · Tong Zhang · 2026-04-27 13:10

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental chal…

arXiv cs.CL TIER_1 · Grigory Sapunov · 2026-04-27 04:00

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

arXiv:2604.21999v1 Announce Type: cross Abstract: We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are e…

arXiv cs.CL TIER_1 · Karthic Palaniappan · 2026-04-27 04:00

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

arXiv:2604.22062v1 Announce Type: new Abstract: There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movi…

arXiv cs.CL TIER_1 · Keshav Ramji, Tahira Naseem, Ram\'on Fernandez Astudillo · 2026-04-27 04:00

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

arXiv:2604.22709v1 Announce Type: new Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging con…

arXiv cs.CL TIER_1 · Zicheng Liu · 2026-04-27 03:34

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, m…

arXiv cs.CL TIER_1 · Yuanyuan Lei · 2026-04-26 20:37

Knowledge Vector of Logical Reasoning in Large Language Models

Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows …

arXiv cs.CL TIER_1 · Ramón Fernandez Astudillo · 2026-04-24 16:45

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance l…

arXiv cs.CL TIER_1 · Karthic Palaniappan · 2026-04-23 20:39

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louis…

arXiv cs.CL TIER_1 · Grigory Sapunov · 2026-04-23 18:30

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations te…

arXiv cs.AI TIER_1 · Lin Sun · 2026-04-23 15:12

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrie…

Hugging Face Daily Papers TIER_1 · 2026-04-23 12:51

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that …

arXiv cs.AI TIER_1 · Csaba Szepesvári · 2026-04-23 12:51

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that …

arXiv cs.CL TIER_1 · Weiming Lu · 2026-04-23 12:19

Language as a Latent Variable for Reasoning Optimization

As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than mer…

Ahead of AI (Sebastian Raschka) TIER_1 · Sebastian Raschka, PhD · 2025-03-29 11:11

First Look at Reasoning From Scratch: Chapter 1

Welcome to the next stage of large language models (LLMs): reasoning. LLMs have transformed how we process and generate text, but their success has been largely driven by statistical pattern recognition. However, new advances in reasoning methodologies now enable LLMs to tackle m…

arXiv cs.CV TIER_1 · Yanzhi Wang · 2026-05-11 16:30

PhyGround: Benchmarking Physical Reasoning in Generative World Models

Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-…

arXiv cs.CV TIER_1 · Wentao Zhang · 2026-05-11 12:18

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy betw…

arXiv stat.ML TIER_1 · Naoto Iwase, Yuki Ichihara, Mohammad Atif Quamar, Junpei Komiyama · 2026-05-11 04:00

Reliable Chain-of-Thought via Prefix Consistency

arXiv:2605.07654v1 Announce Type: new Abstract: Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT…

arXiv cs.CV TIER_1 · Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang · 2026-05-08 04:00

Boosting Reasoning in Large Multimodal Models via Activation Replay

arXiv:2511.19972v3 Announce Type: replace Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-train…

arXiv cs.CV TIER_1 · Xiaoyu Yang, En Yu, Wei Duan, Jie Lu · 2026-05-05 04:00

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

arXiv:2510.04142v2 Announce Type: replace Abstract: This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models of…

arXiv stat.ML TIER_1 · Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini · 2026-05-05 04:00

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

arXiv:2604.18419v2 Announce Type: replace-cross Abstract: LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods de…

arXiv cs.CV TIER_1 · Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yi · 2026-05-05 04:00

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

arXiv:2604.18486v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent Co…

LessWrong (AI tag) TIER_1 · Sturb · 2026-05-01 06:52

Sanity-checking “Incompressible Knowledge Probes”

Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?<h1>Introduction</h1>Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of man…

arXiv cs.CV TIER_1 · Mahnoor Shahid, Hannes Rothe · 2026-04-30 04:00

Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

arXiv:2604.26521v1 Announce Type: cross Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro…

arXiv cs.CV TIER_1 · Hannes Rothe · 2026-04-29 10:41

Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will …

arXiv cs.CV TIER_1 · Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang · 2026-04-28 04:00

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

arXiv:2604.24339v1 Announce Type: new Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information a…

arXiv cs.CV TIER_1 · Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu · 2026-04-28 04:00

DRIFT: Transferring Reasoning Priors for Efficient MLLM Fine-Tuning

arXiv:2510.15050v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have made rapid progress, yet their reasoning ability often lags behind strong text-only LLMs. Bridging this gap typically requires large-scale multimodal reasoning data or reinforcement …

arXiv cs.CV TIER_1 · Yumeng Zhang · 2026-04-27 11:31

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these p…

arXiv cs.CV TIER_1 · Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang · 2026-04-27 04:00

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

arXiv:2510.07632v2 Announce Type: replace-cross Abstract: Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and…

arXiv cs.CV TIER_1 · Anoop Cherian, Radu Corcodel, Siddarth Jain, Diego Romeres · 2026-04-27 04:00

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

arXiv:2411.08027v3 Announce Type: replace-cross Abstract: Most learning-based approaches to complex physical reasoning sidestep the crucial problem of parameter identification (e.g., mass, friction) that governs scene dynamics, despite its importance in real-world applications su…

Smol AINews TIER_1 · 2025-01-23 07:08

Bespoke-Stratos + Sky-T1: The Vicuna+Alpaca moment for reasoning

**Reasoning Distillation** has emerged as a key technique, with Berkeley/USC researchers releasing **Sky-T1-32B-Preview**, a finetuned model of **Qwen 2.5 32B** using 17k reasoning traces for just **$450**, matching benchmarks of **o1-preview**. **DeepSeek** introduced **R1**, a …

Smol AINews TIER_1 · 2024-11-28 01:23

Qwen with Questions: 32B open weights reasoning model nears o1 in GPQA/AIME/Math500

**DeepSeek r1** leads the race for "open o1" models but has yet to release weights, while **Justin Lin** released **QwQ**, a **32B open weight model** that outperforms **GPT-4o** and **Claude 3.5 Sonnet** on benchmarks. QwQ appears to be a fine-tuned version of **Qwen 2.5**, emph…

Smol AINews TIER_1 · 2024-09-13 01:18

o1: OpenAI's new general reasoning models

**OpenAI** has released the **o1** model family, including **o1-preview** and **o1-mini**, focusing on test-time reasoning with extended output token limits over 30k tokens. The models show strong performance, ranking in the 89th percentile on competitive programming, excelling i…

The Gradient TIER_1 · Petar Veličković · 2023-10-14 15:30

Neural algorithmic reasoning

In this article, we will talk about classical computation: the kind of computation typically found in an undergraduate Computer Science course on Algorithms and Data Structures [1]. Think shortest path-finding, sorting, clever ways to break problems down into simpler …

HN — AI infrastructure stories TIER_1 · ksec · 2025-08-23 03:22

Measuring the environmental impact of AI inference

Pandaily TIER_1 · [email protected] (Pandaily) · 2026-05-11 08:33

LaST-R1: New Physical Reasoning Paradigm Achieves 99.9% Success Rate on LIBERO Benchmark

A joint research from Zojian Power, Peking University, and CUHK proposes LaST-R1, a new embodied AI paradigm that achieves 99.9% success on LIBERO benchmark — 22.5% higher than π0.5 in real-world tasks.

HN — claude cli stories TIER_1 · Bayram · 2026-01-21 19:59

Show HN: Retain – A unified knowledge base for all your AI coding conversations

Towards AI TIER_1 Deutsch(DE) · Kaushik Rajan · 2026-05-11 20:01

When Reasoning Hurts: 4 Tasks Where Smaller Models Win

<div class="medium-feed-item"><a href="https://pub.towardsai.net/when-reasoning-hurts-4-tasks-where-smaller-models-win-88486b883896?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1444/1*kh5lQexRfE9L2AK6F6Eosg.png" width=…

Towards AI TIER_1 · R. Thompson (PhD) · 2026-05-11 05:07

The Hive Mind Unleashed: How Swarms Slash Compute While Improving Reasoning

<div class="medium-feed-item"><a href="https://pub.towardsai.net/the-hive-mind-unleashed-how-swarms-slash-compute-while-improving-reasoning-764757579924?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/2600/1*-uX1zjPJiJmuX…

dev.to — Anthropic tag TIER_1 · Gabriel Anhaia · 2026-05-05 15:15

Claude Opus 4.7 Adaptive Thinking: When the Reasoning Tokens Pay Off

<ul> <li> Book: <a href="https://www.amazon.com/dp/B0GX38N645" rel="noopener noreferrer">Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs</a> </li> <li> Also by me: Thinking in Go (2-book series) — <a href="http…

dev.to — LLM tag TIER_1 · LyricalString · 2026-05-11 20:46

Solving the LLM Black Box Problem with Structured Reasoning

The "black box" problem in Large Language Models is often discussed as a philosophical hurdle, but for engineers building high-stakes vertical applications, it is a hard technical bottleneck. In domains like legal tech, medical diagnosis, or financial auditing, a correct answe…

r/LocalLLaMA TIER_1 · /u/Thrumpwart · 2026-04-26 01:23

Structured CoT: Shorter Reasoning with a Grammar File

submitted by <a href="https://www.reddit.com/user/Thrumpwart"> /u/Thrumpwart </a> <a href="https://andthattoo.dev/blog/structured_cot">[link]</a> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1svtsm1/structured_cot_shorter_reaso…

Mastodon — mastodon.social TIER_1 · [email protected] · 2026-05-02 11:52

A tutorial explores how to parse, analyse and visualise reasoning traces from the lambda/hermes-agent-reasoning-traces dataset. It covers understanding how auto

A tutorial explores how to parse, analyse and visualise reasoning traces from the lambda/hermes-agent-reasoning-traces dataset. It covers understanding how autonomous AI agents use tools and generate responses across multi-turn conversations. The guide shows how to prepare data f…

LINKS marktechpost.com/…/a-coding-implementatio…

Mastodon — mastodon.social TIER_1 · aihaberleri · 2026-04-26 08:07

📰 Top 5 Agentic Reasoning Benchmarks for LLMs in 2026 That Predict Real-World Performance As AI agents transition from demos to enterprise use, traditional metr

📰 Top 5 Agentic Reasoning Benchmarks for LLMs in 2026 That Predict Real-World Performance As AI agents transition from demos to enterprise use, traditional metrics like MMLU fall short. The most critical benchmarks now measure real-world agentic reasoning—navigating complex tasks…

Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri · 2026-04-26 08:07

📰 Top 7 Benchmarks for Agentic Reasoning: Real Tests for LLMs LLMs' agentic reasoning capabilities are now beyond mere academic interest...

📰 Agentic Reasoning için En Önemli 7 Benchmark: LLM'lerin Gerçek Testleri LLM'lerin agentic reasoning yetenekleri artık sadece akademik ilgi alanını aşarak endüstriyel uygulamalarda kritik bir avantaj haline geldi. 2025-2026 verileri, bu yetenekleri ölçen 7 temel benchmark'ın nas…

Mastodon — mastodon.social TIER_1 Deutsch(DE) · aihaberleri · 2026-04-26 08:07

📰 AI Agents in Software Development 2025: 5 New Disciplines That Won't Replace Developers AI Agents Are Not Changing Software Development by Replacement,

📰 KI-Agenten in der Softwareentwicklung 2025: 5 Neue Disziplinen, die Entwickler nicht ersetzen KI-Agenten verändern die Softwareentwicklung nicht durch Ersatz, sondern durch die Einführung neuer Disziplinen. Forscher der Chalmers University und der Volvo Group zeigen, dass Entwi…

Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri · 2026-04-26 08:07

📰 AI Agents Are Not Replacing Developers: What Are the New Professions in 2026? (Chalmers Research) AI agents will not destroy software developers

📰 KI-Agentler Geliştiricileri Yerine Geçmiyor: 2026'da Yeni Meslekler Neler? (Chalmers Araştırması) Yapay zeka agentlerinin yazılım geliştiricilerini yok edeceğini iddia eden narratif, Chalmers Üniversitesi ve Volvo Group’un yeni araştırmasına göre yanıltıcı. Gerçek, teknolojinin…

r/cursor TIER_2 · /u/Specialist_Solid523 · 2026-05-05 22:15

First major release of slop CLI (v1.0.0 ): A tool for preventing reasoning drift

submitted by <a href="https://www.reddit.com/user/Specialist_Solid523"> /u/Specialist_Solid523 </a> <a href="/r/LLMDevs/comments/1t4sr9z/slop_cli_major_release_v100/">[link]</a> <a href="https://www.reddit.com/r/cursor/comments/1t4u9mw/…

COVERAGE [211]