New research boosts LLM reasoning with speculative methods and physical insights
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 211 sources
Recent research explores novel methods to enhance the reasoning capabilities and efficiency of large language models (LLMs). Papers introduce techniques like speculative exploration for Tree-of-Thought reasoning to break synchronization bottlenecks and achieve significant speedups. Other work focuses on improving tool-integrated reasoning by pruning erroneous tool calls at inference time and developing frameworks for robots to perform physical reasoning in latent spaces before acting. Additionally, research investigates the effectiveness of different reasoning protocols, such as debate and voting, for LLMs, finding that while some methods improve safety, they don't always enhance usefulness.
AI
Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to …
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer…
Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred fo…
As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet…
As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet…
Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronizat…
Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how …
Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how …
Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure …
When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched …
Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe …
Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a di…
We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formal…
Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps…
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses…
Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treatin…
Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how co…
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, w…
Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy…
Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous stat…
Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, an…
arXiv:2605.06638v1 Announce Type: cross Abstract: Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments.…
arXiv:2605.05737v1 Announce Type: cross Abstract: Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across…
arXiv:2605.06188v1 Announce Type: cross Abstract: On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-…
arXiv:2605.05893v1 Announce Type: new Abstract: Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wep…
arXiv cs.CL
TIER_1·Nicole Lincoln, Nick Whitehouse, Jaron Mar, Rivindu Perera·
arXiv:2605.05532v1 Announce Type: new Abstract: This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixt…
arXiv cs.LG
TIER_1·Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou·
arXiv:2512.18857v3 Announce Type: replace-cross Abstract: Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelin…
arXiv cs.LG
TIER_1·Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi·
arXiv:2510.02312v2 Announce Type: replace Abstract: Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifac…
arXiv cs.LG
TIER_1·William T. Redman, Erik C. Johnson, Brian Robinson·
arXiv:2605.05495v1 Announce Type: new Abstract: Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible comp…
arXiv:2605.05438v1 Announce Type: new Abstract: Standard fine-tuning of transformer models on causal reasoning tasks leads to catastrophic model collapse, where models learn trivial solutions such as always predicting "Yes" or "No" regardless of input structure. We demonstrate th…
arXiv:2605.06326v1 Announce Type: new Abstract: Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong th…
arXiv cs.CL
TIER_1·David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal·
arXiv:2602.11509v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and v…
arXiv:2605.05407v1 Announce Type: new Abstract: Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which of…
arXiv:2605.05561v1 Announce Type: new Abstract: Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of new…
arXiv:2605.05678v1 Announce Type: new Abstract: Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in re…
arXiv cs.AI
TIER_1·Richmond Sin Jing Xuan, Rishabh Bhardwaj, Soujanya Poria·
arXiv:2605.06165v1 Announce Type: new Abstract: As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-w…
arXiv cs.AI
TIER_1·Marc Boubnovski Martell, Josefa Lia Stoisser, Kaspar M\"artens, Jialin Yu, Robert Kitchen, Philip Torr, Jesper Ferkinghoff-Borg·
arXiv:2605.06308v1 Announce Type: new Abstract: Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry …
arXiv:2508.16745v3 Announce Type: replace Abstract: Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation …
arXiv:2605.05566v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequen…
arXiv cs.LG
TIER_1·Aymen Echarghaoui, Dongxia Wu, Emily B. Fox·
arXiv:2605.05386v1 Announce Type: cross Abstract: Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled me…
arXiv:2605.06660v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training a…
Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Exis…
Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reas…
Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. …
On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However…
Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regulariz…
arXiv:2605.04078v1 Announce Type: new Abstract: Reasoning distillation aims to transfer multi-step reasoning capabilities from large language models to smaller, more efficient ones. While recent methods have shown promising gains, they typically rely on static teacher-student hie…
arXiv:2605.04352v1 Announce Type: new Abstract: We introduce a benchmark suite for evaluating structural mathematical reasoning in language models, built on subgroup-construction problems in SL(3, Z) with cryptographic-style verifier-prover asymmetry. Each instance presents a fin…
arXiv cs.LG
TIER_1·Ole-Christoffer Granmo, Youmna Abdelwahab, Per-Arne Andersen, Karl Audun K. Borgersen, Paul F. A. Clarke, Kunal Dumbre, Ylva Gr{\o}nnings{\ae}ter, Vojtech Halenka, Runar Helin, Lei Jiao, Ahmed Khalid, Rebekka Omslandseter, Rupsa Saha, Mayur Shende, Xuan Z·
arXiv:2507.14874v2 Announce Type: replace Abstract: Pattern recognition with concise and flat AND-rules makes the Tsetlin Machine (TM) both interpretable and efficient, while the power of Tsetlin automata enables accuracy comparable to deep learning on an increasing number of dat…
arXiv cs.CL
TIER_1·Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang·
arXiv:2508.04204v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. C…
arXiv cs.AI
TIER_1·Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, Chelsea Finn·
arXiv:2604.17654v3 Announce Type: replace Abstract: Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for…
arXiv:2605.02173v1 Announce Type: new Abstract: We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures…
arXiv cs.AI
TIER_1·Kei Nishimura-Gasparian, Robert McCarthy, David Lindner·
arXiv:2605.02269v1 Announce Type: new Abstract: Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where …
arXiv cs.AI
TIER_1·Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan·
arXiv:2605.01341v1 Announce Type: cross Abstract: Given a knowledge base (KB) with a non-entailed fact, the ABox abduction problem asks for possible extensions of the KB that would entail this fact. This problem has many applications, ranging from diagnosis to explainability and …
arXiv:2509.12464v2 Announce Type: replace Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produce…
arXiv:2601.04809v5 Announce Type: replace Abstract: Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progr…
arXiv cs.AI
TIER_1·Yunjian Zhang, Sudong Wang, Yang Li, Peiran Xu, Conghao Zhou, Xiaoyue Ma, Jianing Li, Yao Zhu·
arXiv:2602.00815v2 Announce Type: replace Abstract: Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reaso…
arXiv cs.AI
TIER_1·Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu·
arXiv:2603.10384v2 Announce Type: replace Abstract: Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematic…
arXiv:2603.17368v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In t…
arXiv cs.LG
TIER_1·Manuel Vargas Guzm\'an, Jakub Szymanik, Maciej Malicki·
arXiv:2510.09472v2 Announce Type: replace-cross Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications such as logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: c…
arXiv:2605.03050v1 Announce Type: new Abstract: Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fa…
arXiv:2605.03314v1 Announce Type: new Abstract: In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{…
arXiv:2605.03936v1 Announce Type: new Abstract: Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: o…
arXiv:2605.03344v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumpti…
Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a…
Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG …
arXiv:2509.24276v4 Announce Type: replace Abstract: Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs…
arXiv:2602.13595v2 Announce Type: replace Abstract: Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile ($E \propto \mathrm{bits}$). In this paper, we demonstrate tha…
arXiv cs.CL
TIER_1·Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, Sam Bayless·
arXiv:2601.20055v2 Announce Type: replace Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers t…
arXiv:2601.08510v3 Announce Type: replace Abstract: Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question ans…
arXiv cs.CL
TIER_1·Shanglin Wu, Lihui Liu, Jinho D. Choi, Kai Shu·
arXiv:2509.03540v3 Announce Type: replace Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) paradigms mitigate this issue by incorporating external …
arXiv:2501.19201v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces signifi…
arXiv cs.CL
TIER_1·Tairan Fu, Javier Conde, Gonzalo Mart\'inez, Mar\'ia Grandury, Pedro Reviriego·
arXiv:2501.09775v3 Announce Type: replace Abstract: Multiple Choice Question (MCQ) tests are among the most used methods for evaluating large language models (LLMs). Besides checking the correctness of the selected answer, evaluations often consider the model's confidence through…
arXiv:2605.02442v1 Announce Type: cross Abstract: In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alon…
arXiv:2605.01111v1 Announce Type: cross Abstract: In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to per…
arXiv:2605.01939v1 Announce Type: new Abstract: Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the…
arXiv:2605.01870v1 Announce Type: new Abstract: Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their…
arXiv:2605.01704v1 Announce Type: new Abstract: When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents itera…
arXiv:2605.01495v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding responses in external knowledge during inference. However, conventiona RAG systems under-perform on structured tabular data, largely due to coar…
arXiv cs.CL
TIER_1(AF)·Sangkwon Park, Donghun Kang, Jisoo Mok, Sungroh Yoon·
arXiv:2605.01399v1 Announce Type: new Abstract: The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to b…
arXiv:2601.05300v2 Announce Type: replace-cross Abstract: Reasoning-oriented language models typically expose explicit reasoning as a long, front-loaded chain of "thinking" tokens before the main output, either always enabled or externally toggled at inference time. Although this…
arXiv:2604.21593v2 Announce Type: replace Abstract: As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the …
arXiv:2605.02122v1 Announce Type: new Abstract: Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator…
arXiv cs.LG
TIER_1·Simone Papicchio, Simone Rossi, Luca Cagliero, Paolo Papotti·
arXiv:2504.15077v5 Announce Type: replace Abstract: Large Language Models (LLMs) can translate natural language into SQL, but small models struggle with multi-table and complex queries in Zero-Shot Learning (ZSL) settings. While Supervised Fine-Tuning (SFT) helps, it falls short …
arXiv:2604.05134v2 Announce Type: replace Abstract: We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets influences language model performance in chess. …
arXiv:2605.00063v1 Announce Type: cross Abstract: Reasoning-Intensive Retrieval (RIR) targets retrieval settings where relevance is mediated by latent inferential links between a query and supporting evidence, rather than semantic similarity. Motivated by the emergent reasoning a…
In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early stream…
Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fail to challenge such erroneous assumptions, and …
In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reason…
Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended act…
arXiv cs.LG
TIER_1·Arunabh Srivastava (Amir), Mohammad A. (Amir), Khojastepour, Srimat Chakradhar, Sennur Ulukus·
arXiv:2605.00798v1 Announce Type: new Abstract: Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language pla…
arXiv:2605.00199v1 Announce Type: cross Abstract: When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning w…
arXiv cs.LG
TIER_1·Yuxuan Gao, Megan Wang, Yi Ling Yu·
arXiv:2605.00300v1 Announce Type: cross Abstract: Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quan…
arXiv:2504.10368v4 Announce Type: replace Abstract: This paper explores the system 1 thinking capability of Large Reasoning Models (LRMs), the intuitive ability to respond efficiently with minimal token usage. While existing LRMs rely on long-chain reasoning and excel at complex …
arXiv:2508.21762v3 Announce Type: replace Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks s…
arXiv cs.CL
TIER_1·Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu·
arXiv:2602.03141v3 Announce Type: replace Abstract: While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and compu…
Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yie…
Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In…
Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enab…
When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to …
Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through co…
arXiv:2604.27644v1 Announce Type: cross Abstract: We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introd…
arXiv cs.AI
TIER_1·Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras·
arXiv:2604.27251v1 Announce Type: cross Abstract: Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoni…
arXiv:2604.27201v1 Announce Type: cross Abstract: Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Ex…
arXiv:2604.27960v1 Announce Type: new Abstract: Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While …
arXiv:2604.27472v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot lear…
arXiv:2509.23744v4 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added …
arXiv:2604.27576v1 Announce Type: cross Abstract: We present BAss (BDD-based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, an…
arXiv:2603.19044v3 Announce Type: replace Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-lev…
arXiv:2604.27998v1 Announce Type: cross Abstract: Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning met…
arXiv cs.AI
TIER_1·Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu·
arXiv:2512.10941v2 Announce Type: replace-cross Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images ar…
Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving s…
When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidenc…
Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and rein…
Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these…
Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these…
We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introduce ANCORA, an anchored-curriculum framework in wh…
We present BAss (BDD-based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, and preferred interpretations, as well as two-valued…
arXiv:2604.26507v1 Announce Type: new Abstract: Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could…
arXiv cs.CL
TIER_1·Dongxin Guo, Jikun Wu, Siu Ming Yiu·
arXiv:2604.26649v1 Announce Type: cross Abstract: Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current R…
arXiv:2604.26573v1 Announce Type: new Abstract: Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exp…
Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abd…
Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data…
Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before r…
Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit…
Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will …
Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of…
arXiv:2604.24809v1 Announce Type: new Abstract: We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a …
arXiv:2603.01070v2 Announce Type: replace Abstract: Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have dem…
arXiv:2604.25444v1 Announce Type: new Abstract: Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment m…
arXiv:2510.16340v2 Announce Type: replace Abstract: Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This developme…
arXiv cs.CL
TIER_1·Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang·
arXiv:2510.07499v2 Announce Type: replace Abstract: Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents …
arXiv cs.CL
TIER_1·Oliver Kraus, Yash Sarrof, Yuekun Yao, Alexander Koller, Michael Hahn·
arXiv:2604.25800v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces long…
arXiv cs.CL
TIER_1·Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen·
arXiv:2506.05205v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used to solve complex tasks where they must retrieve and compose many pieces of in-context information in long reasoning chains. For many real-world tasks it is hard to accurately ga…
arXiv:2604.25907v1 Announce Type: new Abstract: Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis…
Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ th…
Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied…
Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by …
Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by …
arXiv:2601.14952v2 Announce Type: replace Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single l…
arXiv:2604.21764v2 Announce Type: replace Abstract: Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliber…
arXiv:2604.24512v1 Announce Type: new Abstract: As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode t…
arXiv:2604.24443v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning…
arXiv:2510.13117v3 Announce Type: replace-cross Abstract: Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inher…
arXiv:2511.08577v2 Announce Type: replace Abstract: Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine e…
arXiv cs.CL
TIER_1·Yuxuan Jiang, Dawei Li, Francis Ferraro·
arXiv:2505.13975v4 Announce Type: replace Abstract: While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantia…
arXiv:2604.23460v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. C…
arXiv cs.CL
TIER_1·Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu·
arXiv:2604.22951v1 Announce Type: cross Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help mo…
arXiv cs.CL
TIER_1·Sercan Karaka\c{s}, Yusuf \c{S}im\c{s}ek·
arXiv:2604.24665v1 Announce Type: new Abstract: This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze …
arXiv:2604.24003v1 Announce Type: new Abstract: Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this ove…
arXiv:2604.23623v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches impr…
arXiv:2604.23877v1 Announce Type: new Abstract: Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze …
arXiv:2604.23398v1 Announce Type: new Abstract: We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers `"unknown'' when the reasoner-entailed answer is ""no'' under \emph{FunctionalProperty} closure or class \emph{disjointne…
arXiv:2604.23377v1 Announce Type: new Abstract: Neurosymbolic systems can satisfy logical constraints during learning without achieving the intended concept-label correspondence; this is a problem known as reasoning shortcuts. We formalize reasoning shortcuts as a constraint sati…
This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly…
This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly…
As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder-only autore…
Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental chal…
arXiv:2604.21999v1 Announce Type: cross Abstract: We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are e…
arXiv:2604.22062v1 Announce Type: new Abstract: There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movi…
arXiv:2604.22709v1 Announce Type: new Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging con…
Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, m…
Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows …
While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance l…
There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louis…
We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations te…
Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrie…
We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that …
We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that …
As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than mer…
Ahead of AI (Sebastian Raschka)
TIER_1·Sebastian Raschka, PhD·
Welcome to the next stage of large language models (LLMs): reasoning. LLMs have transformed how we process and generate text, but their success has been largely driven by statistical pattern recognition. However, new advances in reasoning methodologies now enable LLMs to tackle m…
Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-…
Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy betw…
arXiv:2605.07654v1 Announce Type: new Abstract: Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT…
arXiv:2511.19972v3 Announce Type: replace Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-train…
arXiv cs.CV
TIER_1·Xiaoyu Yang, En Yu, Wei Duan, Jie Lu·
arXiv:2510.04142v2 Announce Type: replace Abstract: This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models of…
arXiv stat.ML
TIER_1·Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini·
arXiv:2604.18419v2 Announce Type: replace-cross Abstract: LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods de…
arXiv:2604.18486v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent Co…
<p><i><span>Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?</span></i></p><h1><span>Introduction</span></h1><p><span>Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of man…
arXiv:2604.26521v1 Announce Type: cross Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro…
Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will …
arXiv:2604.24339v1 Announce Type: new Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information a…
arXiv:2510.15050v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have made rapid progress, yet their reasoning ability often lags behind strong text-only LLMs. Bridging this gap typically requires large-scale multimodal reasoning data or reinforcement …
Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these p…
arXiv:2510.07632v2 Announce Type: replace-cross Abstract: Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and…
arXiv cs.CV
TIER_1·Anoop Cherian, Radu Corcodel, Siddarth Jain, Diego Romeres·
arXiv:2411.08027v3 Announce Type: replace-cross Abstract: Most learning-based approaches to complex physical reasoning sidestep the crucial problem of parameter identification (e.g., mass, friction) that governs scene dynamics, despite its importance in real-world applications su…
**Reasoning Distillation** has emerged as a key technique, with Berkeley/USC researchers releasing **Sky-T1-32B-Preview**, a finetuned model of **Qwen 2.5 32B** using 17k reasoning traces for just **$450**, matching benchmarks of **o1-preview**. **DeepSeek** introduced **R1**, a …
**DeepSeek r1** leads the race for "open o1" models but has yet to release weights, while **Justin Lin** released **QwQ**, a **32B open weight model** that outperforms **GPT-4o** and **Claude 3.5 Sonnet** on benchmarks. QwQ appears to be a fine-tuned version of **Qwen 2.5**, emph…
**OpenAI** has released the **o1** model family, including **o1-preview** and **o1-mini**, focusing on test-time reasoning with extended output token limits over 30k tokens. The models show strong performance, ranking in the 89th percentile on competitive programming, excelling i…
<p>In this article, we will talk about <em>classical computation</em>: the kind of computation typically found in an undergraduate Computer Science course on Algorithms and Data Structures [1]. Think shortest path-finding, sorting, clever ways to break problems down into simpler …
A joint research from Zojian Power, Peking University, and CUHK proposes LaST-R1, a new embodied AI paradigm that achieves 99.9% success on LIBERO benchmark — 22.5% higher than π0.5 in real-world tasks.
<ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GX38N645" rel="noopener noreferrer">Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) — <a href="http…
<p>The "black box" problem in Large Language Models is often discussed as a philosophical hurdle, but for engineers building high-stakes vertical applications, it is a hard technical bottleneck. In domains like legal tech, medical diagnosis, or financial auditing, a correct answe…
A tutorial explores how to parse, analyse and visualise reasoning traces from the lambda/hermes-agent-reasoning-traces dataset. It covers understanding how autonomous AI agents use tools and generate responses across multi-turn conversations. The guide shows how to prepare data f…
📰 Top 5 Agentic Reasoning Benchmarks for LLMs in 2026 That Predict Real-World Performance As AI agents transition from demos to enterprise use, traditional metrics like MMLU fall short. The most critical benchmarks now measure real-world agentic reasoning—navigating complex tasks…
📰 Agentic Reasoning için En Önemli 7 Benchmark: LLM'lerin Gerçek Testleri LLM'lerin agentic reasoning yetenekleri artık sadece akademik ilgi alanını aşarak endüstriyel uygulamalarda kritik bir avantaj haline geldi. 2025-2026 verileri, bu yetenekleri ölçen 7 temel benchmark'ın nas…
📰 KI-Agenten in der Softwareentwicklung 2025: 5 Neue Disziplinen, die Entwickler nicht ersetzen KI-Agenten verändern die Softwareentwicklung nicht durch Ersatz, sondern durch die Einführung neuer Disziplinen. Forscher der Chalmers University und der Volvo Group zeigen, dass Entwi…
📰 KI-Agentler Geliştiricileri Yerine Geçmiyor: 2026'da Yeni Meslekler Neler? (Chalmers Araştırması) Yapay zeka agentlerinin yazılım geliştiricilerini yok edeceğini iddia eden narratif, Chalmers Üniversitesi ve Volvo Group’un yeni araştırmasına göre yanıltıcı. Gerçek, teknolojinin…