PulseAugur
LIVE 23:16:08
research · [222 sources] ·
0
research

Google advances LLM alignment and accuracy, Hugging Face explores multi-LLM collaboration

Google Research has introduced a new framework to evaluate the alignment of behavioral dispositions in large language models, adapting established psychological assessments into situational judgment tests. This approach quantizes model tendencies against human social inclinations, identifying deviations from human consensus. Separately, Google Research also developed SLED (Self Logits Evolution Decoding), a novel method that enhances LLM factuality by utilizing all model layers rather than just the final one, without requiring external data or fine-tuning. AI

Summary written by gemini-2.5-flash-lite from 222 sources. How we write summaries →

IMPACT New methods for evaluating LLM alignment and improving factuality could lead to more reliable and trustworthy AI systems in various applications.

RANK_REASON The cluster contains two research papers from Google Research detailing new methods for evaluating LLM alignment and improving LLM factuality.

Read on Google AI / Research →

Google advances LLM alignment and accuracy, Hugging Face explores multi-LLM collaboration

COVERAGE [222]

  1. Google AI / Research TIER_1 ·

    Evaluating alignment of behavioral dispositions in LLMs

    Generative AI

  2. Google AI / Research TIER_1 ·

    Making LLMs more accurate by using all of their layers

    Algorithms & Theory

  3. Hugging Face Blog TIER_1 ·

    Consilium: When Multiple LLMs Collaborate

  4. Hugging Face Blog TIER_1 ·

    Mastering Long Contexts in LLMs with KVPress

  5. Hugging Face Blog TIER_1 ·

    Judge Arena: Benchmarking LLMs as Evaluators

  6. Hugging Face Blog TIER_1 ·

    Expert Support case study: Bolstering a RAG app with LLM-as-a-Judge

  7. Hugging Face Blog TIER_1 ·

    CodeGemma - an official Google release for code LLMs

  8. Hugging Face Blog TIER_1 ·

    Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

  9. Hugging Face Blog TIER_1 ·

    Open-source LLMs as LangChain Agents

  10. Hugging Face Blog TIER_1 ·

    Introducing Agents.js: Give tools to your LLMs using JavaScript

  11. arXiv cs.AI TIER_1 · Devvrit Khatri ·

    Learning, Fast and Slow: Towards LLMs That Adapt Continually

    Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context lea…

  12. arXiv cs.AI TIER_1 · Xiaoxing Ma ·

    Uncertainty Quantification for LLM-based Code Generation

    Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt pro…

  13. arXiv cs.CL TIER_1 · Fuli Feng ·

    SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

    Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing…

  14. arXiv cs.CL TIER_1 · Dayiheng Liu ·

    On Predicting the Post-training Potential of Pre-trained LLMs

    The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to…

  15. arXiv cs.CL TIER_1 · Xiangdong Su ·

    Training-Inference Consistent Segmented Execution for Long-Context LLMs

    Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve eff…

  16. arXiv cs.LG TIER_1 · Marco Cuturi ·

    DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures

    Multi-domain fine-tuning of large language models requires improving performance on target domains while preserving performance on constrained domains, such as general knowledge, instruction following, or safety evaluations. Existing data mixing strategies rely on fixed heuristic…

  17. arXiv cs.AI TIER_1 · Jens Albrecht ·

    LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

    We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for …

  18. arXiv cs.CL TIER_1 · Martin Vechev ·

    Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

    Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems.…

  19. Hugging Face Daily Papers TIER_1 ·

    Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

    Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…

  20. arXiv cs.AI TIER_1 · Jing Li ·

    Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

    Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…

  21. arXiv cs.CL TIER_1 · Deep Shah ·

    The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods

    Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax op…

  22. arXiv cs.CL TIER_1 · Yanran Li ·

    Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

    Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilis…

  23. arXiv cs.CL TIER_1 · Ash Lewis ·

    GLiGuard: Schema-Conditioned Classification for LLM Safeguard

    Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fun…

  24. arXiv cs.CL TIER_1 · Ning Xu ·

    Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement

    Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to "rigid rejection," where a general template (e.g., "I cannot fulfill this request") indiscriminately triggers refusals an…

  25. arXiv cs.AI TIER_1 · James Z. Wang ·

    Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

    Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been show…

  26. arXiv cs.AI TIER_1 · Abbas Rahimi ·

    POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

    Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification …

  27. arXiv cs.AI TIER_1 · Mark James Carman ·

    Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared …

  28. arXiv cs.LG TIER_1 · Nan Jiang ·

    Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existi…

  29. arXiv cs.CL TIER_1 · Xiaozhang Liu ·

    From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

    Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for sur…

  30. arXiv cs.CL TIER_1 · Ruben Fernandez-Boullon, David N. Olivieri ·

    Patch-Effect Graph Kernels for LLM Interpretability

    arXiv:2605.06480v1 Announce Type: cross Abstract: Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high…

  31. arXiv cs.AI TIER_1 · Amal Alnouri, Andreas Hinterreiter, Christina Humer, Furui Cheng, Marc Streit ·

    Visual Fingerprints for LLM Generation Comparison

    arXiv:2605.06054v1 Announce Type: new Abstract: Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which …

  32. arXiv cs.AI TIER_1 · Nguyen Viet Tuan Kiet, Bui Dinh Pham, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh ·

    Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs

    arXiv:2605.06123v1 Announce Type: new Abstract: Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search o…

  33. arXiv cs.AI TIER_1 · Xinmiao Huang, Jinwei Hu, Rajarshi Roy, Changshun Wu, Yi Dong, Xiaowei Huang ·

    PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

    arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored e…

  34. arXiv cs.AI TIER_1 · Kaifeng He, Xiaojun Zhang, Peiliang Cai, Mingwei Liu, Yanlin Wang, Chong Wang, Kaifeng Huang, Bihuan Chen, Xin Peng, Zibin Zheng ·

    Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    arXiv:2605.05267v1 Announce Type: cross Abstract: Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empi…

  35. arXiv cs.AI TIER_1 · Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao ·

    Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

    arXiv:2605.06111v1 Announce Type: cross Abstract: Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified mul…

  36. arXiv cs.AI TIER_1 · Chengjie Wang, Jingzheng Wu, Xiang Ling, Tianyue Luo, Chen Zhao ·

    Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

    arXiv:2605.06279v1 Announce Type: cross Abstract: Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version ch…

  37. arXiv cs.CL TIER_1 · Atharva Naik, Yash Mathur, Prakam, Carolyn Rose, David Mortensen ·

    ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

    arXiv:2605.05485v1 Announce Type: new Abstract: LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic …

  38. arXiv cs.LG TIER_1 · Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng ·

    Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

    arXiv:2605.05957v1 Announce Type: new Abstract: LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and cons…

  39. arXiv cs.LG TIER_1 · Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao, Vaneet Aggarwal ·

    Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

    arXiv:2605.05973v1 Announce Type: cross Abstract: Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy proc…

  40. arXiv cs.LG TIER_1 · Florian A. D. Burnat, Brittany I. Davidson ·

    Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

    arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context…

  41. arXiv cs.LG TIER_1 · Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck ·

    MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

    arXiv:2605.06334v1 Announce Type: cross Abstract: Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is chall…

  42. arXiv cs.LG TIER_1 · Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian ·

    Sample-efficient LLM Optimization with Reset Replay

    arXiv:2508.06412v3 Announce Type: replace Abstract: Recent advancements in LLM post-training, particularly through reinforcement learning and preference optimization, are key to boosting their reasoning capabilities. However, these methods often suffer from low sample efficiency …

  43. arXiv cs.LG TIER_1 · Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei ·

    LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

    arXiv:2601.20375v2 Announce Type: replace Abstract: Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (…

  44. arXiv cs.LG TIER_1 · Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov ·

    Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

    arXiv:2512.09538v2 Announce Type: replace-cross Abstract: Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring …

  45. arXiv cs.LG TIER_1 · Andy Zeyi Liu, Elliot Paquette, John Sous ·

    Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

    arXiv:2605.05683v1 Announce Type: cross Abstract: Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family…

  46. arXiv cs.LG TIER_1 · Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bj{\o}rklund, Leon Moonen, Klas Pettersen, Michael A. Riegler ·

    When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

    arXiv:2605.06652v1 Announce Type: new Abstract: Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and …

  47. arXiv cs.LG TIER_1 · Dylan Bouchard ·

    Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

    arXiv:2605.06350v1 Announce Type: new Abstract: Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical…

  48. arXiv cs.LG TIER_1 · Xinrui Chen, Liu Yang, Ou Wu ·

    One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning

    arXiv:2605.06166v1 Announce Type: new Abstract: In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly…

  49. arXiv cs.LG TIER_1 · Jonas Bayer, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Michael Tautschnig, Soonho Kong ·

    Teaching LLMs Program Semantics via Symbolic Execution Traces

    arXiv:2605.06184v1 Announce Type: cross Abstract: We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We fin…

  50. arXiv cs.AI TIER_1 · Michael A. Riegler ·

    When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

    Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-base…

  51. arXiv cs.AI TIER_1 · David N. Olivieri ·

    Patch-Effect Graph Kernels for LLM Interpretability

    Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are diffi…

  52. arXiv cs.AI TIER_1 · Xiaowei Huang ·

    PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

    Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM…

  53. arXiv cs.AI TIER_1 · Dylan Bouchard ·

    Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

    Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the ge…

  54. arXiv cs.CL TIER_1 · Anne-Kathrin Schmuck ·

    MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

    Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans i…

  55. arXiv cs.AI TIER_1 · Brittany I. Davidson ·

    Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

    Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in…

  56. arXiv cs.CL TIER_1 · Ge Lei, Samuel J. Cooper ·

    Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

    arXiv:2605.04764v1 Announce Type: new Abstract: Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under …

  57. arXiv cs.CL TIER_1 · Sruly Rosenblat, Tim O'Reilly, Ilan Strauss ·

    Beyond Public Access in LLM Pre-Training Data

    arXiv:2505.00020v2 Announce Type: replace Abstract: Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models show recognition of copyrighted content. Our r…

  58. arXiv cs.LG TIER_1 · Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang ·

    DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

    arXiv:2602.05890v2 Announce Type: replace Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods …

  59. arXiv cs.AI TIER_1 · Bo Bai ·

    Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

    arXiv:2511.01202v3 Announce Type: replace-cross Abstract: Despite the unprecedented empirical triumphs of LLMs across diverse real-world applications, the prevailing research paradigm remains overwhelmingly heuristic and experimentally driven, inextricably tethered to astronomica…

  60. arXiv cs.AI TIER_1 · Hongkun Yu ·

    Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

    arXiv:2605.03227v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically …

  61. arXiv cs.LG TIER_1 · Xiao Wang, Yifei Zhang, YongKang Liu, Xiaocui Yang, Zihan Wang, Shi Feng, Daling Wang ·

    From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

    arXiv:2605.04572v1 Announce Type: cross Abstract: Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain…

  62. arXiv cs.LG TIER_1 · Sumeet Ramesh Motwani, Chuan Du, Aleksander Petrov, Christopher Davis, Philip Torr, Antonio Papania-Davis, Weishi Yan ·

    AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

    arXiv:2604.16804v2 Announce Type: replace Abstract: Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires speciali…

  63. arXiv cs.LG TIER_1 · Luze Sun, Alina Oprea, Eric Wong ·

    Syntax- and Compilation-Preserving Evasion of LLM Vulnerability Detectors

    arXiv:2602.00305v2 Announce Type: replace-cross Abstract: LLM-based vulnerability detectors are increasingly deployed in CI/CD security gating, yet their resilience to evasion under syntax- and compilation-preserving edits remains poorly understood. We evaluate five attack varian…

  64. arXiv cs.LG TIER_1 · Jonas K\"ubler, Kailash Budhathoki, Matth\"aus Kleindessner, Xiong Zhou, Junming Yin, Ashish Khetan, George Karypis ·

    When LLMs get significantly worse: A statistical approach to detect model degradations

    arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization.…

  65. arXiv cs.CL TIER_1 · Samuel J. Cooper ·

    Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

    Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under sparse observations, showing that it depends str…

  66. Hugging Face Daily Papers TIER_1 ·

    From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning

    Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidde…

  67. arXiv cs.LG TIER_1 · Yi Liu ·

    Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

    arXiv:2605.03379v1 Announce Type: new Abstract: Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeat…

  68. arXiv cs.AI TIER_1 · Youpeng Li, Fuxun Yu, Xinda Wang ·

    From SFT to RL: Demystifying the Post-Training Pipeline for LLM-based Vulnerability Detection

    arXiv:2602.14012v2 Announce Type: replace-cross Abstract: The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systema…

  69. arXiv cs.AI TIER_1 · Yifei Wang, Ruiyin Li, Peng Liang, Yangxiao Cai, Zengyang Li, Mojtaba Shahin, Arif Ali Khan, Qiong Feng ·

    Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

    arXiv:2605.01392v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human …

  70. arXiv cs.AI TIER_1 · Jia Xiao ·

    NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    arXiv:2605.01847v1 Announce Type: new Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment in…

  71. arXiv cs.LG TIER_1 · Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita ·

    Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

    arXiv:2605.03441v1 Announce Type: cross Abstract: Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using for…

  72. arXiv cs.LG TIER_1 · Miaomiao Li, Hao Chen, Yang Wang, Tingyuan Zhu, Weijia Zhang, Kaijie Zhu, Kam-Fai Wong, Jindong Wang ·

    Understanding and Mitigating Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks

    arXiv:2502.04419v3 Announce Type: replace Abstract: Generating synthetic datasets via large language models (LLMs) has emerged as a promising approach to improve LLM performance. However, LLMs inherently reflect biases in their training data, leading to a critical challenge: when…

  73. arXiv cs.LG TIER_1 · Hyunji Nam, Haoran Li, Natasha Jaques ·

    Maximizing mutual information between prompts and responses improve LLM personalization with no additional data or human oversight

    arXiv:2603.19294v2 Announce Type: replace Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high…

  74. arXiv cs.CL TIER_1 · Haesung Lee, Gyubin Choi, Eun-Ju Lee, So-Min Lee, Youkang Ko, Dogyoon Lim, Sung-Kyoung Jang, Yohan Jo ·

    TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

    arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance …

  75. arXiv cs.LG TIER_1 · Shannon K. Gallagher, Swati Rallapalli, Tyler Brooks, Chuck Loughin, Michele Sezgin, Ronald Yurko ·

    Analysis and Explainability of LLMs Via Evolutionary Methods

    arXiv:2605.02930v1 Announce Type: cross Abstract: Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to bet…

  76. arXiv cs.CL TIER_1 · Richard A. A. Jonker, Alexander Christiansen, Alexandros Maniatis, R\'uben Garrido, Rog\'erio Braunschweiger de Freitas Lima, Roman Jurowetzki, S\'ergio Matos ·

    BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA

    arXiv:2605.03618v1 Announce Type: new Abstract: This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of trai…

  77. arXiv cs.CL TIER_1 · Yohan Jo ·

    TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

    Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial proces…

  78. arXiv cs.CL TIER_1 · Sérgio Matos ·

    BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA

    This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of training data and the strict data privacy constraint…

  79. arXiv cs.CL TIER_1 · Shanu Sushmita ·

    Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

    Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quan…

  80. arXiv cs.CL TIER_1 · Yi Liu ·

    Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

    Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls.…

  81. arXiv cs.CL TIER_1 · Ian Rios-Sialer ·

    The Homogenization Problem in LLMs: Towards Meaningful Diversity in AI Safety

    arXiv:2601.06116v3 Announce Type: replace-cross Abstract: Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized …

  82. arXiv cs.CL TIER_1 · Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Goa, Juming Xiong, Zhijun Yin, Bradley A. Malin ·

    CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

    arXiv:2605.01011v1 Announce Type: new Abstract: Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) fr…

  83. arXiv cs.CL TIER_1 · Koshiro Saito, Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki ·

    LLM Output Detectability and Task Performance Can be Jointly Optimized

    arXiv:2605.01350v1 Announce Type: new Abstract: Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detect…

  84. arXiv cs.CL TIER_1 · Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Schol ·

    Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

    arXiv:2605.01417v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavil…

  85. arXiv cs.CL TIER_1 · Sadia Asif, Mohammad Mohammadi Amiri ·

    RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

    arXiv:2605.01913v1 Announce Type: cross Abstract: Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features a…

  86. arXiv cs.CL TIER_1 · Noga Peleg Pelc, Gal A. Kaminka, Yoav Goldberg ·

    A Language for Describing Agentic LLM Contexts

    arXiv:2605.01920v1 Announce Type: cross Abstract: Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The desig…

  87. arXiv cs.CL TIER_1 · Pawel Kaplanski (Kaplanski AI Lab) ·

    Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    arXiv:2605.02236v1 Announce Type: cross Abstract: Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in…

  88. arXiv cs.CL TIER_1 · Ziyi Zhu, Olivier Tieleman, Alexey Bukhtiyarov, Jinghong Chen ·

    CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

    arXiv:2603.01865v3 Announce Type: replace Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are o…

  89. arXiv cs.CL TIER_1 · Antonio Valerio Miceli Barone, Poon Tsz Nok ·

    Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

    arXiv:2604.17010v2 Announce Type: replace Abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validati…

  90. arXiv cs.CL TIER_1 · Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Alexander Binder, Sebastian Lapuschkin ·

    Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

    arXiv:2506.13727v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mech…

  91. arXiv cs.LG TIER_1 · Nickil Maveli, Antonio Vergari, Shay B. Cohen ·

    Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility

    arXiv:2601.13398v2 Announce Type: replace Abstract: LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that…

  92. arXiv cs.LG TIER_1 · Jimyung Hong, Jaehyung Kim ·

    Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score

    arXiv:2603.23985v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions …

  93. arXiv cs.LG TIER_1 · Timoth\'ee Chauvin, Cl\'ement Lalanne, Erwan Le Merrer, Jean-Michel Loubes, Fran\c{c}ois Ta\"iani, Gilles Tredan ·

    Token-Efficient Change Detection in LLM APIs

    arXiv:2602.11083v2 Announce Type: replace Abstract: Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to a…

  94. arXiv cs.AI TIER_1 · Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi, Abhilasha Ravichander, Muhammad Bilal Zafar ·

    To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

    arXiv:2605.00737v1 Announce Type: new Abstract: Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM d…

  95. arXiv cs.AI TIER_1 · Fazle Rabbi, Lin Ling, Song Wang, Jinqiu Yang ·

    Social Bias in LLM-Generated Code: Benchmark and Mitigation

    arXiv:2605.00382v2 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leav…

  96. arXiv cs.AI TIER_1 · Abdurrahman Javat, Allan Kazakov ·

    Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

    arXiv:2605.00519v2 Announce Type: cross Abstract: The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This pap…

  97. arXiv cs.AI TIER_1 · Lehan He, Zeren Chen, Zhe Zhang, Xiang Gao, Lu Sheng ·

    Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback

    arXiv:2506.18315v2 Announce Type: replace-cross Abstract: LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied Test-Driven Development (TDD) to refine code, these methods are ofte…

  98. arXiv cs.CL TIER_1 · Pawel Kaplanski ·

    Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model f…

  99. arXiv cs.CL TIER_1 · Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang ·

    SCAN: Structured Capability Assessment and Navigation for LLMs

    arXiv:2505.06698v4 Announce Type: replace Abstract: Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model r…

  100. arXiv cs.CL TIER_1 Français(FR) · Ryan Lail, Luke Markham ·

    On Cost-Effective LLM-as-a-Judge Improvement Techniques

    arXiv:2604.13717v2 Announce Type: replace Abstract: Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. H…

  101. arXiv cs.LG TIER_1 · Jiale Fu, Yuchu Jiang, Peijun Wu, Chonghan Liu, Joey Tianyi Zhou, Xu Yang ·

    Rethinking LLM Ensembling from the Perspective of Mixture Models

    arXiv:2605.00419v1 Announce Type: new Abstract: Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. Th…

  102. arXiv cs.LG TIER_1 · Pavlin G. Poli\v{c}ar, Andra\v{z} Pevcin, Bla\v{z} Zupan ·

    Generating Statistical Charts with Validation-Driven LLM Workflows

    arXiv:2605.00800v1 Announce Type: new Abstract: Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely pro…

  103. arXiv cs.CL TIER_1 · Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh ·

    When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    arXiv:2605.00817v1 Announce Type: new Abstract: Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through…

  104. arXiv cs.CL TIER_1 · Yoav Goldberg ·

    A Language for Describing Agentic LLM Contexts

    Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The design of the encoded information and its structure pla…

  105. arXiv cs.CL TIER_1 · Mohammad Mohammadi Amiri ·

    RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

    Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within th…

  106. arXiv cs.CL TIER_1 · Mayank Singh ·

    When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedura…

  107. arXiv cs.LG TIER_1 · Blaž Zupan ·

    Generating Statistical Charts with Validation-Driven LLM Workflows

    Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable…

  108. arXiv cs.AI TIER_1 · Muhammad Bilal Zafar ·

    To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

    Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, whe…

  109. arXiv cs.AI TIER_1 · Abdurrahman Javat ·

    Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

    The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper presents a systematic empirical analysis of the…

  110. arXiv cs.CL TIER_1 · Xu Yang ·

    Rethinking LLM Ensembling from the Perspective of Mixture Models

    Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large lan…

  111. arXiv cs.AI TIER_1 · Jinqiu Yang ·

    Social Bias in LLM-Generated Code: Benchmark and Mitigation

    Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM-generated code largely unex…

  112. arXiv cs.AI TIER_1 · Jon-Paul Cacioli ·

    Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

    arXiv:2604.27405v1 Announce Type: cross Abstract: We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.…

  113. arXiv cs.AI TIER_1 · Ziyao Xu, Cong Wang, Houfeng Wang ·

    Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective

    arXiv:2604.27340v1 Announce Type: new Abstract: Compositional generalization tests are often used to estimate the compositionality of LLMs. However, such tests have the following limitations: (1) they only focus on the output results without considering LLMs' understanding of sam…

  114. arXiv cs.LG TIER_1 · Jun Yeon Won, Xin Jin, Shiqing Ma, Zhiqiang Lin ·

    REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

    arXiv:2604.27319v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical …

  115. arXiv cs.LG TIER_1 · Ahan Gupta, Zhihao Wang, Neel Dani, Masahiro Tanaka, Olatunji Ruwase, Minjia Zhang ·

    AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

    arXiv:2604.27089v1 Announce Type: new Abstract: Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy t…

  116. arXiv cs.CL TIER_1 · Solomon Messing ·

    Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking

    arXiv:2604.11581v4 Announce Type: replace Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet standard confidence intervals ignore variability from prompt phrasing, model temperature, and…

  117. arXiv cs.CL TIER_1 · Jon-Paul Cacioli ·

    Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

    We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…

  118. Hugging Face Daily Papers TIER_1 ·

    Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

    We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…

  119. arXiv cs.AI TIER_1 · Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, Panos Louridas ·

    The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

    arXiv:2508.16131v2 Announce Type: replace-cross Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, c…

  120. arXiv cs.AI TIER_1 · Emre Furkan Akyol, Mehmet Dedeler, Eray T\"uz\"un ·

    ImproBR: Bug Report Improver Using LLMs

    arXiv:2604.26142v1 Announce Type: cross Abstract: Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omit essential details such as Steps to Reproduce (S2R), Observed Behavior (OB), and…

  121. arXiv cs.CL TIER_1 · Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer ·

    Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

    arXiv:2601.03423v3 Announce Type: replace Abstract: Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling…

  122. arXiv cs.CL TIER_1 · Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu ·

    Learning to Ask: When LLM Agents Meet Unclear Instruction

    arXiv:2409.00557v4 Announce Type: replace Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of thes…

  123. arXiv cs.CL TIER_1 · Hongyeon Yu, Young-Bum Kim, Yoon Kim ·

    FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

    arXiv:2604.26258v1 Announce Type: new Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building po…

  124. arXiv cs.CL TIER_1 · Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea ·

    One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

    arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD…

  125. Hugging Face Daily Papers TIER_1 ·

    AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

    Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context …

  126. arXiv cs.LG TIER_1 · Keita Broadwater ·

    Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

    arXiv:2602.11786v2 Announce Type: replace Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often expo…

  127. arXiv cs.CL TIER_1 · Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang ·

    Benchmarking and Adapting On-Device LLMs for Clinical Decision Support

    arXiv:2601.03266v2 Announce Type: replace Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow…

  128. arXiv cs.CL TIER_1 · Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi ·

    VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

    arXiv:2512.12072v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper,…

  129. arXiv cs.CL TIER_1 · Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra ·

    Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

    arXiv:2604.25098v1 Announce Type: cross Abstract: While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning method…

  130. arXiv cs.CL TIER_1 · Huyen Nguyen, Haoxuan Zhang, Yang Zhang, Junhua Ding, Haihua Chen ·

    LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

    arXiv:2604.25665v1 Announce Type: new Abstract: Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarizatio…

  131. arXiv cs.CL TIER_1 · Yoon Kim ·

    FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients

    LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. Ho…

  132. arXiv cs.CL TIER_1 · Haihua Chen ·

    LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

    Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …

  133. Hugging Face Daily Papers TIER_1 ·

    LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

    Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …

  134. arXiv cs.CL TIER_1 · Rohith Reddy Bellibatlu ·

    JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

    arXiv:2604.23478v1 Announce Type: new Abstract: Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a fra…

  135. arXiv cs.AI TIER_1 · Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens ·

    Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

    arXiv:2511.08484v2 Announce Type: replace Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infre…

  136. arXiv cs.LG TIER_1 · Juyeon Yoon, Somin Kim, Robert Feldt, Shin Yoo ·

    Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

    arXiv:2509.17314v3 Announce Type: replace-cross Abstract: Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and co…

  137. arXiv cs.LG TIER_1 · Frank Xiao, Santiago Aranguri ·

    Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training

    arXiv:2602.11079v3 Announce Type: replace Abstract: We propose probe-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference…

  138. arXiv cs.LG TIER_1 · Xuancheng Li, Haitao Li, Yujia Zhou, Yiqun Liu, Qingyao Ai ·

    Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs

    arXiv:2602.02556v2 Announce Type: replace Abstract: Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We int…

  139. arXiv cs.LG TIER_1 · Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang ·

    TRINITY: An Evolved LLM Coordinator

    arXiv:2512.04695v3 Announce Type: replace Abstract: Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large langu…

  140. arXiv cs.LG TIER_1 · Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma ·

    Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

    arXiv:2604.23987v1 Announce Type: new Abstract: Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more …

  141. arXiv cs.LG TIER_1 · Zhengding Hu, Hehua Ouyang, Chang Chen, Zaifeng Pan, Yue Guan, Zhongkai Yu, Zhen Wang, Steven Swanson, Yufei Ding ·

    JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    arXiv:2604.23838v1 Announce Type: new Abstract: We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalan…

  142. arXiv cs.CL TIER_1 · Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi ·

    KLong: Training LLM Agent for Extremely Long-horizon Tasks

    arXiv:2602.17547v3 Announce Type: replace-cross Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. S…

  143. arXiv cs.CL TIER_1 · Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik ·

    MathDuels: Evaluating LLMs as Problem Posers and Solvers

    arXiv:2604.21916v2 Announce Type: replace Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers …

  144. arXiv cs.CL TIER_1 · Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian K\"astner, Tongshuang Wu ·

    What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts

    arXiv:2505.13360v3 Announce Type: replace Abstract: Prompt underspecification is a common challenge when interacting with LLMs. In this paper, we present an in-depth analysis of this problem, showing that while LLMs can often infer unspecified requirements by default (41.1%), suc…

  145. arXiv cs.CL TIER_1 · Alessio Sordo, Lingxiao Du, Meeka-Hanna Lenisa, Evgeny Bogdanov, Maxim Romanovsky ·

    STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

    arXiv:2604.24544v1 Announce Type: cross Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due t…

  146. arXiv cs.CL TIER_1 · Anshuman Chhabra ·

    Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

    While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing p…

  147. Hugging Face Daily Papers TIER_1 ·

    STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

    The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…

  148. arXiv cs.CL TIER_1 · Maxim Romanovsky ·

    STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

    The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…

  149. arXiv cs.CL TIER_1 · Sourav Saha, Mandar Mitra, Aditya Dutta ·

    LLMs as Assessors: Right for the Right Reason?

    arXiv:2601.08919v2 Announce Type: replace-cross Abstract: A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this…

  150. arXiv cs.LG TIER_1 · Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian ·

    CAP: Controllable Alignment Prompting for Unlearning in LLMs

    arXiv:2604.21251v2 Announce Type: replace Abstract: Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-m…

  151. arXiv cs.LG TIER_1 · Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar ·

    Removing Sandbagging in LLMs by Training with Weak Supervision

    arXiv:2604.22082v1 Announce Type: new Abstract: As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap thr…

  152. arXiv cs.AI TIER_1 · Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca ·

    BLAST: Benchmarking LLMs with ASP-based Structured Testing

    arXiv:2604.22306v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has …

  153. arXiv cs.AI TIER_1 · Francesco Ricca ·

    BLAST: Benchmarking LLMs with ASP-based Structured Testing

    Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling decla…

  154. arXiv cs.AI TIER_1 · Vivek Hebbar ·

    Removing Sandbagging in LLMs by Training with Weak Supervision

    As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears ac…

  155. Hugging Face Daily Papers TIER_1 ·

    MathDuels: Evaluating LLMs as Problem Posers and Solvers

    As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…

  156. arXiv cs.CL TIER_1 · Mayur Naik ·

    MathDuels: Evaluating LLMs as Problem Posers and Solvers

    As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…

  157. arXiv cs.LG TIER_1 · Wenhong Tian ·

    CAP: Controllable Alignment Prompting for Unlearning in LLMs

    Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high …

  158. Hugging Face Daily Papers TIER_1 ·

    HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

    Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately asses…

  159. Ahead of AI (Sebastian Raschka) TIER_1 · Sebastian Raschka, PhD ·

    Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

    Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples

  160. Ahead of AI (Sebastian Raschka) TIER_1 · Sebastian Raschka, PhD ·

    Coding LLMs from the Ground Up: A Complete Course

    Why build LLMs from scratch? It's probably the best and most efficient way to learn how LLMs really work. Plus, many readers have told me they had a lot of fun doing it.

  161. LessWrong (AI tag) TIER_1 · Santiago Aranguri ·

    Predicting Rare LLM Failures with 30× Fewer Rollouts

    <p><span>TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.</span></p><p><span>Authors: Francisco Pernice (MIT), Santiag…

  162. arXiv stat.ML TIER_1 · James Fiedler ·

    Bias and Uncertainty in LLM-as-a-Judge Estimation

    arXiv:2605.06939v1 Announce Type: cross Abstract: LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed es…

  163. arXiv stat.ML TIER_1 · Nicolas Menet, Andreas Krause, Abbas Rahimi ·

    POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

    arXiv:2605.07775v1 Announce Type: cross Abstract: Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel …

  164. arXiv stat.ML TIER_1 · James Fiedler ·

    Bias and Uncertainty in LLM-as-a-Judge Estimation

    LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliabili…

  165. LessWrong (AI tag) TIER_1 · NickyP ·

    Axes of Planning in LLMs + Partial Lit Review

    <p><i><span>Epistemic Status: Written over the course of a couple days at </span></i><a href="https://inkhaven.blog/" rel="noreferrer"><i><span>Inkhaven</span></i></a><i><span>. Some of the info is old so some newer papers are excluded.</span></i></p><p><i><span>TL;DR: People tal…

  166. arXiv stat.ML TIER_1 · Vaneet Aggarwal ·

    Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

    Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level…

  167. arXiv stat.ML TIER_1 · John Sous ·

    Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

    Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded Na…

  168. arXiv cs.CV TIER_1 · Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee ·

    From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

    arXiv:2605.00358v1 Announce Type: cross Abstract: LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreadi…

  169. arXiv cs.CV TIER_1 · Wee Sun Lee ·

    From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

    LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used …

  170. LessWrong (AI tag) TIER_1 · Santiago Aranguri ·

    Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

    <h1><b><span>Introduction</span></b></h1><p><i><span>Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire).</span></i></p><p><span>Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoi…

  171. LessWrong (AI tag) TIER_1 · keshavs ·

    Introspection Adapters: Training LLMs to Report Their Learned Behaviors

    <p><i><span>Authors: Keshav Shenoy,</span></i><span> </span><i><span>Li Yang, Abhay Sheshadri, Soren Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang</span></i></p><p><span>📄</span><a href="https://arxiv.org/pdf/2604.16812"><span>Paper</span></a><span>, 💻 </span><a href="https…

  172. arXiv cs.CV TIER_1 · Mengyu Wang, Xiaoying Zhi, Zhiyi Li, Robin Schmucker, Shay B. Cohen, Tiejun Ma, Fran Silavong ·

    Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge

    arXiv:2604.22939v1 Announce Type: cross Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this perform…

  173. Smol AINews TIER_1 ·

    Thinking Machines' Tinker: LoRA based LLM fine-tuning API

    **Thinking Machines** recently raised **$2 billion** without shipping a product until now, launching their first product **Tinker**, a managed service API for fine-tuning large and mixture-of-experts models like **Qwen-235B-A22B** using **LoRA** for cost-efficient training. The T…

  174. Eugene Yan TIER_1 ·

    AI Engineer 2025 - Improving RecSys & Search with LLM techniques

    Recsys & search are converging with LLMs via semantic IDs, data augmentation, and unified foundation models.

  175. Smol AINews TIER_1 Norsk(NO) ·

    Meta BLT: Tokenizer-free, Byte-level LLM

    **Meta AI** introduces the **Byte Latent Transformer (BLT)**, a tokenizer-free architecture that dynamically forms byte patches for efficient compute allocation, outperforming **Llama 3** on benchmarks including the CUTE benchmark. The model was trained on approximately **1 trill…

  176. Eugene Yan TIER_1 ·

    Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

    Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

  177. Eugene Yan TIER_1 ·

    Task-Specific LLM Evals that Do & Don't Work

    Evals for classification, summarization, translation, copyright regurgitation, and toxicity.

  178. Chip Huyen TIER_1 ·

    Open challenges in LLM research

    <p>[<em><a href="https://www.linkedin.com/posts/chiphuyen_llm-airesearch-generativeai-activity-7097619722363408385-s5Cp">LinkedIn discussion</a>, <a href="https://twitter.com/chipro/status/1691858084824838427">Twitter thread</a></em>]</p> <p>Never before in my life had I seen so …

  179. Eugene Yan TIER_1 ·

    Patterns for Building LLM-based Systems & Products

    Evals, RAG, fine-tuning, caching, guardrails, defensive UX, and collecting user feedback.

  180. Eugene Yan TIER_1 ·

    Experimenting with LLMs to Research, Reflect, and Plan

    Also, shortcomings in document retrieval and how to overcome them with search & recsys techniques.

  181. Databricks Blog TIER_1 ·

    LLM Vs AI: A Practical Guide to Differences, Use Cases, and Tools

    This guide explains the key differences between large language models and the broader...

  182. AWS Machine Learning Blog TIER_1 · Hemanth Kumar Jayakumar ·

    Reinforcement fine-tuning with LLM-as-a-judge

    In this post, we take a deeper look at how RLAIF or RL with LLM-as-a-judge works with Amazon Nova models effectively.

  183. Hamel Husain TIER_1 · Shreya Shankar ·

    LLM Evals: Everything You Need to Know

    <!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>This document curates the most common questions Shreya and I received while <a href="https://bit.ly/evals-ai" target="…

  184. Hamel Husain TIER_1 · Hamel Husain ·

    Using LLM-as-a-Judge For Evaluation: A Complete Guide

    <!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Earlier this year, I wrote <a href="https://hamel.dev/blog/posts/evals/">Your AI product needs evals</a>. Many of you …

  185. Hamel Husain TIER_1 · Hamel Husain ·

    An Open Course on LLMs, Led by Practitioners

    <!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Today, we are releasing <a href="https://parlance-labs.com/education/">Mastering LLMs</a>, a set of workshops and talk…

  186. Hacker News — AI stories ≥50 points TIER_1 · khurdula ·

    Show HN: A new benchmark for testing LLMs for deterministic outputs

  187. HN — claude-code stories TIER_1 · mufeedvh ·

    N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

  188. HN — AI infrastructure stories TIER_1 · cgorlla ·

    Launch HN: Mentat (YC F24) – Controlling LLMs with Runtime Intervention

  189. HN — AI infrastructure stories TIER_1 · diptanu ·

    Show HN: Open-source real time data framework for LLM applications

  190. Practical AI TIER_1 · Practical AI LLC ·

    Collaboration & evaluation for LLM apps

    <p>Small changes in prompts can create large changes in the output behavior of generative AI models. Add to that the confusion around proper evaluation of LLM applications, and you have a recipe for confusion and frustration. Raza and the Humanloop team have been diving into thes…

  191. Medium — MLOps tag TIER_1 · Siddhartha Pramanik ·

    Building a Prompt Regression Suite for Our Customer-Facing LLM App

    <div class="medium-feed-item"><p class="medium-feed-link"><a href="https://pub.aimind.so/building-a-prompt-regression-suite-for-our-customer-facing-llm-app-22f0b27b7301?source=rss------mlops-5">Continue reading on AI Mind »</a></p></div>

  192. Towards AI TIER_1 · Akshit Kothari ·

    Decoding LLMs — Part 2: A Step-by-Step Journey Into the Mind of Modern AIe

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/decoding-llms-part-2-a-step-by-step-journey-into-the-mind-of-modern-aie-882e9f39e371?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1656/1*jOp9pKrjWuAYXGvT…

  193. Medium — fine-tuning tag TIER_1 Bahasa(ID) · dita feby indriani ·

    Getting to Know LoRA, QLoRA, and PEFT in LLM Fine-Tuning

    <div class="medium-feed-item"><p class="medium-feed-snippet">Perkembangan Large Language Models (LLM) seperti GPT, LLaMA, dan Mistral membuka banyak peluang dalam pengembangan aplikasi berbasis&#x2026;</p><p class="medium-feed-link"><a href="https://medium.com/@ditafebyindriani14…

  194. dev.to — MCP tag TIER_1 · Mukunda Rao Katta ·

    Six Reliability Primitives for LLM Agents

    <p>Reliability concerns for LLM agents are typically bundled into one heavy framework that asks you to adopt prompting, tool routing, and runtime governance as a single dependency. Production teams want them à la carte. They want small primitives they can drop in around existing …

  195. Towards AI TIER_1 · Ishwar Ambare ·

    HuggingFace Pipeline & Open-Source LLMs

    <h4>Part 3</h4><h4>GenAI Practical Session — Detailed Notes</h4><blockquote><em>Source: Lecture Transcript + </em><a href="https://huggingface.co/docs/transformers/pipeline_tutorial"><em>HuggingFace Pipeline Docs</em></a><em> + </em><a href="https://huggingface.co/models"><em>Hug…

  196. dev.to — MCP tag TIER_1 · Tony Loehr ·

    The 55.6% problem: why frontier LLMs fail at embedded code

    <p><strong>55.6%.</strong></p> <p>That's DeepSeek-R1's pass@1 on EmbedBench when it gets a circuit schematic alongside the task description. 50.0% without the schematic. Best score from the best reasoning model on the first comprehensive benchmark for LLMs in embedded systems dev…

  197. Lobsters — AI tag TIER_1 · pipevals.com by gesposito ·

    Pipevals: Evaluation pipelines for every LLM application

    <p><a href="https://lobste.rs/s/iexiw9/pipevals_evaluation_pipelines_for_every">Comments</a></p>

  198. HN — AI startup stories TIER_1 · felix089 ·

    Show HN: FinetuneDB – AI fine-tuning platform to create custom LLMs

  199. dev.to — LLM tag TIER_1 · Prakhar Singh ·

    Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

    <blockquote> <p>If you cannot measure it, you cannot route it. Why offline evaluation is the difference between a code reviewer that improves over time and one the team dismisses within a sprint.</p> </blockquote> <p>Chat evaluations are vibes-based: thumbs-up on "was this helpfu…

  200. dev.to — LLM tag TIER_1 Deutsch(DE) · 丁久 ·

    LLM Fine-Tuning Strategies and Techniques

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tuning-strategies.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…

  201. dev.to — LLM tag TIER_1 · 丁久 ·

    Prompt Chaining: Building Multi-Step LLM Workflows

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/ai-prompt-chaining.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…

  202. dev.to — LLM tag TIER_1 · Vikrant Shukla ·

    The Softmax Bottleneck: Why Making LLMs Bigger Doesn't Always Make Them Smarter

    <p>When researchers scale a language model — more parameters, more layers, wider hidden dimensions — there's an implicit assumption: a bigger model can represent more things. More expressiveness, more knowledge, better predictions. Mostly this is true. But there's a structural ce…

  203. dev.to — LLM tag TIER_1 · Adnan Latif ·

    Scaling LLM + Vector DB Systems in Production: Lessons from the Trenches

    <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3058op9ajg2qx39n30h.png"><img alt="Cover Image" height="533" …

  204. dev.to — LLM tag TIER_1 · 蔡俊鹏 ·

    Run Open-Source LLMs Locally: From Ollama to DeepSeek and Build Your Private AI

    <h2> Foreword </h2> <p>In 2026, open-source LLMs aren't lab experiments anymore. Meta's Llama 4, Alibaba's Qwen 3, DeepSeek-R1 from China — they've caught up with or beaten closed-source models on many benchmarks. And thanks to tools like Ollama and llama.cpp, anyone with a mid-r…

  205. dev.to — LLM tag TIER_1 · Vikrant Shukla ·

    Lost in the Middle: Why LLMs Quietly Ignore the Centre of Their Own Context Window

    <p>Every time you hand a long document to an LLM and ask it to summarise or answer a question, something quietly goes wrong. The model reads the whole thing — or appears to — but its answers disproportionately reflect what was at the beginning and the end. Whatever sat in the mid…

  206. dev.to — LLM tag TIER_1 · 丁久 ·

    LLM Evaluation and Benchmarking Guide 2026: Beyond Simple Evals

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/llm-evaluation-benchmarks.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…

  207. dev.to — LLM tag TIER_1 · 丁久 ·

    LLM Function Calling: Complete Developer Guide with Code Examples

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/function-calling-guide.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…

  208. dev.to — LLM tag TIER_1 · 丁久 ·

    Fine-Tuning Open Source LLMs: A Developer's Practical Guide (2026)

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tune-open-source-llm.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…

  209. dev.to — LLM tag TIER_1 · Alan West ·

    Debugging confidently wrong answers from LLM-powered features

    <h2> The bug that took two weeks to surface </h2> <p>A few months back I shipped a feature that used a language model to summarize support tickets and suggest responses. Internal QA loved it. The demo went great. Two weeks after launch, our support lead pinged me on Slack: "Are t…

  210. dev.to — LLM tag TIER_1 · Nitin Srivastava ·

    Bulletproofing LLM Structured Output in Python: Healing Retries, Cost Caps, and Drift Detection (Runnable Code)

    <p>I shipped a structured-output endpoint to production in March. The schema was clean, JSON mode was on, the model was GPT-4.1, the eval suite was green. Three weeks in, the on-call channel lit up because a downstream billing job had silently skipped 4,200 records over a weekend…

  211. dev.to — LLM tag TIER_1 · BN ·

    Deterministic reliability stack for LLM pipelines

    <p>I have been spending the last few months wiring up a deterministic reliability stack for structured LLM pipelines.</p> <p>Today, LLM Contract Check (locc) and Release Governor went live on PyPI. EGA went live last week.</p> <p>The stack is straightforward:<br /> LLM Contract C…

  212. dev.to — LLM tag TIER_1 · Machine coding Master ·

    Stop Guessing Your RAG Quality: Automating Faithfulness Metrics with Spring AI and LLM-as-a-Judge

    <h2> Stop Shipping Hallucinations: Automating RAG Faithfulness with Spring AI 1.2 </h2> <p>If you’re still "vibe-checking" your RAG outputs in 2026, you’re not an engineer; you’re a gambler. Enterprise-grade AI isn't about getting a cool demo—it's about proving your model isn't h…

  213. dev.to — LLM tag TIER_1 · Rob ·

    Model Showdown: Benchmarking Local vs Cloud LLMs on a Real Coding Task

    <p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…

  214. dev.to — LLM tag TIER_1 · Rob ·

    Putting the GPU to Work: Running Local LLMs on a Home Lab

    <p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…

  215. dev.to — LLM tag TIER_1 · Rob ·

    Putting the GPU to Work: Running Local LLMs on a Home Lab

    <p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…

  216. dev.to — LLM tag TIER_1 · Rob ·

    Model Showdown: Benchmarking Local vs Cloud LLMs on a Real Coding Task

    <p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…

  217. dev.to — LLM tag TIER_1 · Nitin Srivastava ·

    Building a Production LLM Evaluation Harness in Pytest: Cost-Bounded, Flake-Aware, CI-Gated (Runnable Python)

    <p>I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer.</p> <p>The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a…

  218. dev.to — LLM tag TIER_1 · NaveenKumar Namachivayam ⚡ ·

    Beyond the Hype: A Comprehensive Guide to Benchmarking LLMs with AWS Labs’ LLMeter

    <p id="p-rc_9231198f56807c04-27">In the current AI gold rush, the conversation has shifted from "Can it do the task?" to "How efficiently can it do the task?" For engineers moving Large Language Models (LLMs) into production, the "vibe check" is no longer sufficient. You need har…

  219. dev.to — LLM tag TIER_1 · Gabriel Anhaia ·

    LLM Response Caching: When the 80/20 Hit Rate Saves the Bill

    <ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GYLHMLMT" rel="noopener noreferrer">LLM Observability Pocket Guide: Picking the Right Tracing &amp; Evals Tools for Your Team</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) …

  220. r/Anthropic TIER_1 · /u/RJSabouhi ·

    Resource: source-boundary failures in LLM evidence use

    <table> <tr><td> <a href="https://www.reddit.com/r/Anthropic/comments/1tc6d7q/resource_sourceboundary_failures_in_llm_evidence/"> <img alt="Resource: source-boundary failures in LLM evidence use" src="https://external-preview.redd.it/P69EsmfdRn1YdPKlugVsTLq4e-YcCHd7HH4pMEc65E0.pn…

  221. Mastodon — mastodon.social TIER_1 · aihaberleri ·

    📰 Systematic Prompting in 2026: Negative Constraints & Structured JSON for LLM Reliability Systematic prompting is transforming how developers engineer LLM inte

    📰 Systematic Prompting in 2026: Negative Constraints & Structured JSON for LLM Reliability Systematic prompting is transforming how developers engineer LLM interactions, with negative constraints, structured JSON outputs, and multi-hypothesis sampling emerging as critical techniq…

  222. Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri ·

    📰 Systematic Prompt Engineering 2026: Negative Constraints, JSON Outputs, and Multi-Hypothesis Methods Systematic prompt engineering for AI developers,

    📰 Sistemli Prompt Mühendisliği 2026: Negatif Kısıtlar, JSON Çıktıları ve Çoklu Hipotez Yöntemleri Yapay zeka geliştiricileri için sistemli prompt mühendisliği, sadece soru sormaktan çok, cevabı nasıl şekillendireceğinizi öğrenmektir. Negatif kısıtlar, yapılandırılmış JSON çıktıla…