Google advances LLM alignment and accuracy, Hugging Face explores multi-LLM collaboration
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 222 sources
Google Research has introduced a new framework to evaluate the alignment of behavioral dispositions in large language models, adapting established psychological assessments into situational judgment tests. This approach quantizes model tendencies against human social inclinations, identifying deviations from human consensus. Separately, Google Research also developed SLED (Self Logits Evolution Decoding), a novel method that enhances LLM factuality by utilizing all model layers rather than just the final one, without requiring external data or fine-tuning.
AI
IMPACT
New methods for evaluating LLM alignment and improving factuality could lead to more reliable and trustworthy AI systems in various applications.
RANK_REASON
The cluster contains two research papers from Google Research detailing new methods for evaluating LLM alignment and improving LLM factuality.
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context lea…
Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt pro…
Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing…
The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to…
Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve eff…
Multi-domain fine-tuning of large language models requires improving performance on target domains while preserving performance on constrained domains, such as general knowledge, instruction following, or safety evaluations. Existing data mixing strategies rely on fixed heuristic…
We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for …
Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems.…
Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…
Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via pr…
Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax op…
Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilis…
Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fun…
Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to "rigid rejection," where a general template (e.g., "I cannot fulfill this request") indiscriminately triggers refusals an…
Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been show…
Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification …
This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared …
Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existi…
Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for sur…
arXiv cs.CL
TIER_1·Ruben Fernandez-Boullon, David N. Olivieri·
arXiv:2605.06480v1 Announce Type: cross Abstract: Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high…
arXiv cs.AI
TIER_1·Amal Alnouri, Andreas Hinterreiter, Christina Humer, Furui Cheng, Marc Streit·
arXiv:2605.06054v1 Announce Type: new Abstract: Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which …
arXiv cs.AI
TIER_1·Nguyen Viet Tuan Kiet, Bui Dinh Pham, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh·
arXiv:2605.06123v1 Announce Type: new Abstract: Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search o…
arXiv:2605.06455v1 Announce Type: new Abstract: Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored e…
arXiv:2605.05267v1 Announce Type: cross Abstract: Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empi…
arXiv cs.AI
TIER_1·Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao·
arXiv:2605.06111v1 Announce Type: cross Abstract: Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified mul…
arXiv:2605.06279v1 Announce Type: cross Abstract: Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version ch…
arXiv:2605.05485v1 Announce Type: new Abstract: LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic …
arXiv cs.LG
TIER_1·Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng·
arXiv:2605.05957v1 Announce Type: new Abstract: LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and cons…
arXiv:2605.05973v1 Announce Type: cross Abstract: Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy proc…
arXiv cs.LG
TIER_1·Florian A. D. Burnat, Brittany I. Davidson·
arXiv:2605.06327v1 Announce Type: cross Abstract: Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context…
arXiv:2605.06334v1 Announce Type: cross Abstract: Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is chall…
arXiv cs.LG
TIER_1·Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian·
arXiv:2508.06412v3 Announce Type: replace Abstract: Recent advancements in LLM post-training, particularly through reinforcement learning and preference optimization, are key to boosting their reasoning capabilities. However, these methods often suffer from low sample efficiency …
arXiv cs.LG
TIER_1·Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei·
arXiv:2601.20375v2 Announce Type: replace Abstract: Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (…
arXiv cs.LG
TIER_1·Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov·
arXiv:2512.09538v2 Announce Type: replace-cross Abstract: Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring …
arXiv cs.LG
TIER_1·Andy Zeyi Liu, Elliot Paquette, John Sous·
arXiv:2605.05683v1 Announce Type: cross Abstract: Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family…
arXiv cs.LG
TIER_1·Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bj{\o}rklund, Leon Moonen, Klas Pettersen, Michael A. Riegler·
arXiv:2605.06652v1 Announce Type: new Abstract: Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and …
arXiv:2605.06350v1 Announce Type: new Abstract: Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical…
arXiv:2605.06166v1 Announce Type: new Abstract: In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly…
arXiv cs.LG
TIER_1·Jonas Bayer, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Michael Tautschnig, Soonho Kong·
arXiv:2605.06184v1 Announce Type: cross Abstract: We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We fin…
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-base…
Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are diffi…
Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM…
Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the ge…
Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans i…
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in…
arXiv:2605.04764v1 Announce Type: new Abstract: Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under …
arXiv cs.CL
TIER_1·Sruly Rosenblat, Tim O'Reilly, Ilan Strauss·
arXiv:2505.00020v2 Announce Type: replace Abstract: Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models show recognition of copyrighted content. Our r…
arXiv:2602.05890v2 Announce Type: replace Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods …
arXiv:2511.01202v3 Announce Type: replace-cross Abstract: Despite the unprecedented empirical triumphs of LLMs across diverse real-world applications, the prevailing research paradigm remains overwhelmingly heuristic and experimentally driven, inextricably tethered to astronomica…
arXiv:2605.03227v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically …
arXiv:2605.04572v1 Announce Type: cross Abstract: Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain…
arXiv cs.LG
TIER_1·Sumeet Ramesh Motwani, Chuan Du, Aleksander Petrov, Christopher Davis, Philip Torr, Antonio Papania-Davis, Weishi Yan·
arXiv:2604.16804v2 Announce Type: replace Abstract: Optimization problems are central to decision-making in manufacturing, logistics, scheduling, and other industrial settings. Translating complicated descriptions of these problems into solver-ready formulations requires speciali…
arXiv cs.LG
TIER_1·Luze Sun, Alina Oprea, Eric Wong·
arXiv:2602.00305v2 Announce Type: replace-cross Abstract: LLM-based vulnerability detectors are increasingly deployed in CI/CD security gating, yet their resilience to evasion under syntax- and compilation-preserving edits remains poorly understood. We evaluate five attack varian…
arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization.…
Large language models are increasingly used as surrogate models for low-data optimization, but their optimizer-facing prediction and its uncertainty remain poorly understood. We study the surrogate belief elicited from an LLM under sparse observations, showing that it depends str…
Safety alignment of Large Language Models (LLMs) is extremely fragile, as fine-tuning on a small number of benign samples can erase safety behaviors learned from millions of preference examples. Existing studies attempt to explain this phenomenon by comparing parameters and hidde…
arXiv:2605.03379v1 Announce Type: new Abstract: Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeat…
arXiv:2602.14012v2 Announce Type: replace-cross Abstract: The integration of LLMs into vulnerability detection (VD) has shifted the field toward more interpretable and context-aware analysis. While post-training techniques have shown promise in general coding tasks, their systema…
arXiv cs.AI
TIER_1·Yifei Wang, Ruiyin Li, Peng Liang, Yangxiao Cai, Zengyang Li, Mojtaba Shahin, Arif Ali Khan, Qiong Feng·
arXiv:2605.01392v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human …
arXiv:2605.01847v1 Announce Type: new Abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment in…
arXiv cs.LG
TIER_1·Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita·
arXiv:2605.03441v1 Announce Type: cross Abstract: Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using for…
arXiv:2502.04419v3 Announce Type: replace Abstract: Generating synthetic datasets via large language models (LLMs) has emerged as a promising approach to improve LLM performance. However, LLMs inherently reflect biases in their training data, leading to a critical challenge: when…
arXiv:2603.19294v2 Announce Type: replace Abstract: While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new high…
arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance …
arXiv cs.LG
TIER_1·Shannon K. Gallagher, Swati Rallapalli, Tyler Brooks, Chuck Loughin, Michele Sezgin, Ronald Yurko·
arXiv:2605.02930v1 Announce Type: cross Abstract: Evolutionary methods have long been useful for analysis and explanation in genetics, biology, ecology, and related fields. In this work, we extend these methods to neural networks, specifically large language models (LLMs), to bet…
arXiv cs.CL
TIER_1·Richard A. A. Jonker, Alexander Christiansen, Alexandros Maniatis, R\'uben Garrido, Rog\'erio Braunschweiger de Freitas Lima, Roman Jurowetzki, S\'ergio Matos·
arXiv:2605.03618v1 Announce Type: new Abstract: This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of trai…
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial proces…
This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of training data and the strict data privacy constraint…
Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quan…
Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls.…
arXiv:2601.06116v3 Announce Type: replace-cross Abstract: Generative AI models reproduce the human biases in their training data and further amplify them through mechanisms such as mode collapse. The loss of diversity produces homogenization, which not only harms the minoritized …
arXiv cs.CL
TIER_1·Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Goa, Juming Xiong, Zhijun Yin, Bradley A. Malin·
arXiv:2605.01011v1 Announce Type: new Abstract: Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) fr…
arXiv:2605.01350v1 Announce Type: new Abstract: Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detect…
arXiv cs.CL
TIER_1·Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Schol·
arXiv:2605.01417v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavil…
arXiv cs.CL
TIER_1·Sadia Asif, Mohammad Mohammadi Amiri·
arXiv:2605.01913v1 Announce Type: cross Abstract: Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features a…
arXiv cs.CL
TIER_1·Noga Peleg Pelc, Gal A. Kaminka, Yoav Goldberg·
arXiv:2605.01920v1 Announce Type: cross Abstract: Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The desig…
arXiv cs.CL
TIER_1·Pawel Kaplanski (Kaplanski AI Lab)·
arXiv:2605.02236v1 Announce Type: cross Abstract: Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in…
arXiv:2603.01865v3 Announce Type: replace Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be averaged out by increasing the number of scenarios or generations. These biases are o…
arXiv:2604.17010v2 Announce Type: replace Abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validati…
arXiv cs.CL
TIER_1·Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Alexander Binder, Sebastian Lapuschkin·
arXiv:2506.13727v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mech…
arXiv cs.LG
TIER_1·Nickil Maveli, Antonio Vergari, Shay B. Cohen·
arXiv:2601.13398v2 Announce Type: replace Abstract: LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that…
arXiv:2603.23985v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by removing entire dimensions …
arXiv:2602.11083v2 Announce Type: replace Abstract: Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to a…
arXiv cs.AI
TIER_1·Qinyuan Wu, Soumi Das, Mahsa Amani, Arijit Nag, Seungeon Lee, Krishna P. Gummadi, Abhilasha Ravichander, Muhammad Bilal Zafar·
arXiv:2605.00737v1 Announce Type: new Abstract: Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM d…
arXiv cs.AI
TIER_1·Fazle Rabbi, Lin Ling, Song Wang, Jinqiu Yang·
arXiv:2605.00382v2 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leav…
arXiv cs.AI
TIER_1·Abdurrahman Javat, Allan Kazakov·
arXiv:2605.00519v2 Announce Type: cross Abstract: The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This pap…
arXiv:2506.18315v2 Announce Type: replace-cross Abstract: LLMs excel at code generation, yet ensuring the functional correctness of their outputs remains a persistent challenge. While recent studies have applied Test-Driven Development (TDD) to refine code, these methods are ofte…
Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model f…
arXiv:2505.06698v4 Announce Type: replace Abstract: Evaluating Large Language Models (LLMs) has become increasingly important, with automatic evaluation benchmarks gaining prominence as alternatives to human evaluation. While existing research has focused on approximating model r…
arXiv cs.CL
TIER_1Français(FR)·Ryan Lail, Luke Markham·
arXiv:2604.13717v2 Announce Type: replace Abstract: Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. H…
arXiv:2605.00419v1 Announce Type: new Abstract: Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. Th…
arXiv cs.LG
TIER_1·Pavlin G. Poli\v{c}ar, Andra\v{z} Pevcin, Bla\v{z} Zupan·
arXiv:2605.00800v1 Announce Type: new Abstract: Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely pro…
arXiv:2605.00817v1 Announce Type: new Abstract: Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through…
Large language models are increasingly used within larger systems ("LLM agents"). These make a sequence of LLM calls, each call providing the LLM with a combination of instructions, observations, and interaction history. The design of the encoded information and its structure pla…
Fine-tuning safety-aligned language models for downstream tasks often leads to substantial degradation of refusal behavior, making models vulnerable to adversarial misuse. While prior work has shown that safety-relevant features are encoded in structured representations within th…
Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedura…
Generating diverse, readable statistical charts from tabular data remains challenging for LLMs, as many failures become apparent after rendering and are not detectable from data or code alone. Existing chart datasets also rarely provide fully aligned artifacts, such as executable…
Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, whe…
The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper presents a systematic empirical analysis of the…
Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large lan…
Large Language Models (LLMs) are increasingly deployed to generate code for human-centered applications where demographic fairness is critical. However, existing evaluations focus almost exclusively on functional correctness, leaving social bias in LLM-generated code largely unex…
arXiv:2604.27405v1 Announce Type: cross Abstract: We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.…
arXiv:2604.27340v1 Announce Type: new Abstract: Compositional generalization tests are often used to estimate the compositionality of LLMs. However, such tests have the following limitations: (1) they only focus on the output results without considering LLMs' understanding of sam…
arXiv:2604.27319v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical …
arXiv:2604.27089v1 Announce Type: new Abstract: Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy t…
arXiv:2604.11581v4 Announce Type: replace Abstract: LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet standard confidence intervals ignore variability from prompt phrasing, model temperature, and…
We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…
We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). O…
arXiv:2508.16131v2 Announce Type: replace-cross Abstract: Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, c…
arXiv cs.AI
TIER_1·Emre Furkan Akyol, Mehmet Dedeler, Eray T\"uz\"un·
arXiv:2604.26142v1 Announce Type: cross Abstract: Bug tracking systems play a crucial role in software maintenance, yet developers frequently struggle with low-quality user-submitted reports that omit essential details such as Steps to Reproduce (S2R), Observed Behavior (OB), and…
arXiv cs.CL
TIER_1·Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer·
arXiv:2601.03423v3 Announce Type: replace Abstract: Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling…
arXiv cs.CL
TIER_1·Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu·
arXiv:2409.00557v4 Announce Type: replace Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of thes…
arXiv cs.CL
TIER_1·Hongyeon Yu, Young-Bum Kim, Yoon Kim·
arXiv:2604.26258v1 Announce Type: new Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building po…
arXiv cs.CL
TIER_1·Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea·
arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD…
Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context …
arXiv:2602.11786v2 Announce Type: replace Abstract: Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often expo…
arXiv cs.CL
TIER_1·Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang·
arXiv:2601.03266v2 Announce Type: replace Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow…
arXiv:2512.12072v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper,…
arXiv:2604.25098v1 Announce Type: cross Abstract: While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning method…
arXiv:2604.25665v1 Announce Type: new Abstract: Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarizatio…
LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. Ho…
Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …
Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven …
arXiv:2604.23478v1 Announce Type: new Abstract: Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a fra…
arXiv:2511.08484v2 Announce Type: replace Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infre…
arXiv cs.LG
TIER_1·Juyeon Yoon, Somin Kim, Robert Feldt, Shin Yoo·
arXiv:2509.17314v3 Announce Type: replace-cross Abstract: Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and co…
arXiv:2602.11079v3 Announce Type: replace Abstract: We propose probe-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference…
arXiv:2602.02556v2 Announce Type: replace Abstract: Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We int…
arXiv cs.LG
TIER_1·Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang·
arXiv:2512.04695v3 Announce Type: replace Abstract: Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large langu…
arXiv:2604.23987v1 Announce Type: new Abstract: Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more …
arXiv:2604.23838v1 Announce Type: new Abstract: We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalan…
arXiv:2602.17547v3 Announce Type: replace-cross Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. S…
arXiv:2604.21916v2 Announce Type: replace Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers …
arXiv cs.CL
TIER_1·Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian K\"astner, Tongshuang Wu·
arXiv:2505.13360v3 Announce Type: replace Abstract: Prompt underspecification is a common challenge when interacting with LLMs. In this paper, we present an in-depth analysis of this problem, showing that while LLMs can often infer unspecified requirements by default (41.1%), suc…
arXiv:2604.24544v1 Announce Type: cross Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due t…
While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing p…
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and t…
arXiv:2601.08919v2 Announce Type: replace-cross Abstract: A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this…
arXiv:2604.22082v1 Announce Type: new Abstract: As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap thr…
arXiv cs.AI
TIER_1·Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca·
arXiv:2604.22306v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has …
Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling decla…
As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears ac…
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a sel…
Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high …
Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately asses…
Ahead of AI (Sebastian Raschka)
TIER_1·Sebastian Raschka, PhD·
Why build LLMs from scratch? It's probably the best and most efficient way to learn how LLMs really work. Plus, many readers have told me they had a lot of fun doing it.
<p><span>TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.</span></p><p><span>Authors: Francisco Pernice (MIT), Santiag…
arXiv:2605.06939v1 Announce Type: cross Abstract: LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed es…
arXiv stat.ML
TIER_1·Nicolas Menet, Andreas Krause, Abbas Rahimi·
arXiv:2605.07775v1 Announce Type: cross Abstract: Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel …
LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliabili…
<p><i><span>Epistemic Status: Written over the course of a couple days at </span></i><a href="https://inkhaven.blog/" rel="noreferrer"><i><span>Inkhaven</span></i></a><i><span>. Some of the info is old so some newer papers are excluded.</span></i></p><p><i><span>TL;DR: People tal…
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level…
Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded Na…
arXiv:2605.00358v1 Announce Type: cross Abstract: LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreadi…
LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used …
<h1><b><span>Introduction</span></b></h1><p><i><span>Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire).</span></i></p><p><span>Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoi…
arXiv:2604.22939v1 Announce Type: cross Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this perform…
**Thinking Machines** recently raised **$2 billion** without shipping a product until now, launching their first product **Tinker**, a managed service API for fine-tuning large and mixture-of-experts models like **Qwen-235B-A22B** using **LoRA** for cost-efficient training. The T…
**Meta AI** introduces the **Byte Latent Transformer (BLT)**, a tokenizer-free architecture that dynamically forms byte patches for efficient compute allocation, outperforming **Llama 3** on benchmarks including the CUTE benchmark. The model was trained on approximately **1 trill…
<p>[<em><a href="https://www.linkedin.com/posts/chiphuyen_llm-airesearch-generativeai-activity-7097619722363408385-s5Cp">LinkedIn discussion</a>, <a href="https://twitter.com/chipro/status/1691858084824838427">Twitter thread</a></em>]</p> <p>Never before in my life had I seen so …
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>This document curates the most common questions Shreya and I received while <a href="https://bit.ly/evals-ai" target="…
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Earlier this year, I wrote <a href="https://hamel.dev/blog/posts/evals/">Your AI product needs evals</a>. Many of you …
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>Today, we are releasing <a href="https://parlance-labs.com/education/">Mastering LLMs</a>, a set of workshops and talk…
Hacker News — AI stories ≥50 points
TIER_1·khurdula·
<p>Small changes in prompts can create large changes in the output behavior of generative AI models. Add to that the confusion around proper evaluation of LLM applications, and you have a recipe for confusion and frustration. Raza and the Humanloop team have been diving into thes…
<div class="medium-feed-item"><p class="medium-feed-link"><a href="https://pub.aimind.so/building-a-prompt-regression-suite-for-our-customer-facing-llm-app-22f0b27b7301?source=rss------mlops-5">Continue reading on AI Mind »</a></p></div>
<div class="medium-feed-item"><p class="medium-feed-snippet">Perkembangan Large Language Models (LLM) seperti GPT, LLaMA, dan Mistral membuka banyak peluang dalam pengembangan aplikasi berbasis…</p><p class="medium-feed-link"><a href="https://medium.com/@ditafebyindriani14…
<p>Reliability concerns for LLM agents are typically bundled into one heavy framework that asks you to adopt prompting, tool routing, and runtime governance as a single dependency. Production teams want them à la carte. They want small primitives they can drop in around existing …
<p><strong>55.6%.</strong></p> <p>That's DeepSeek-R1's pass@1 on EmbedBench when it gets a circuit schematic alongside the task description. 50.0% without the schematic. Best score from the best reasoning model on the first comprehensive benchmark for LLMs in embedded systems dev…
Lobsters — AI tag
TIER_1·pipevals.com by gesposito·
<blockquote> <p>If you cannot measure it, you cannot route it. Why offline evaluation is the difference between a code reviewer that improves over time and one the team dismisses within a sprint.</p> </blockquote> <p>Chat evaluations are vibes-based: thumbs-up on "was this helpfu…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tuning-strategies.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/ai-prompt-chaining.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…
<p>When researchers scale a language model — more parameters, more layers, wider hidden dimensions — there's an implicit assumption: a bigger model can represent more things. More expressiveness, more knowledge, better predictions. Mostly this is true. But there's a structural ce…
<h2> Foreword </h2> <p>In 2026, open-source LLMs aren't lab experiments anymore. Meta's Llama 4, Alibaba's Qwen 3, DeepSeek-R1 from China — they've caught up with or beaten closed-source models on many benchmarks. And thanks to tools like Ollama and llama.cpp, anyone with a mid-r…
<p>Every time you hand a long document to an LLM and ask it to summarise or answer a question, something quietly goes wrong. The model reads the whole thing — or appears to — but its answers disproportionately reflect what was at the beginning and the end. Whatever sat in the mid…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/llm-evaluation-benchmarks.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/function-calling-guide.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/fine-tune-open-source-llm.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post…
<h2> The bug that took two weeks to surface </h2> <p>A few months back I shipped a feature that used a language model to summarize support tickets and suggest responses. Internal QA loved it. The demo went great. Two weeks after launch, our support lead pinged me on Slack: "Are t…
<p>I shipped a structured-output endpoint to production in March. The schema was clean, JSON mode was on, the model was GPT-4.1, the eval suite was green. Three weeks in, the on-call channel lit up because a downstream billing job had silently skipped 4,200 records over a weekend…
<p>I have been spending the last few months wiring up a deterministic reliability stack for structured LLM pipelines.</p> <p>Today, LLM Contract Check (locc) and Release Governor went live on PyPI. EGA went live last week.</p> <p>The stack is straightforward:<br /> LLM Contract C…
<h2> Stop Shipping Hallucinations: Automating RAG Faithfulness with Spring AI 1.2 </h2> <p>If you’re still "vibe-checking" your RAG outputs in 2026, you’re not an engineer; you’re a gambler. Enterprise-grade AI isn't about getting a cool demo—it's about proving your model isn't h…
<p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…
<p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…
<p><a href="https://dev.to/posts/from-idea-to-infrastructure-standing-up-a-self-hosted-ai-dev-environment">Yesterday</a> we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is runn…
<p>Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?</p> <p>Not good enough in the abstract benchmarks-on-a-leaderbo…
<p>I shipped my fourth LLM agent to production last quarter. By month two, the eval suite that "passed in CI" was the reason a regression made it to a customer.</p> <p>The tests were green. But they were green for the wrong reason — every assertion was a single LLM call against a…
dev.to — LLM tag
TIER_1·NaveenKumar Namachivayam ⚡·
<p id="p-rc_9231198f56807c04-27">In the current AI gold rush, the conversation has shifted from "Can it do the task?" to "How efficiently can it do the task?" For engineers moving Large Language Models (LLMs) into production, the "vibe check" is no longer sufficient. You need har…
<ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GYLHMLMT" rel="noopener noreferrer">LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) …
📰 Systematic Prompting in 2026: Negative Constraints & Structured JSON for LLM Reliability Systematic prompting is transforming how developers engineer LLM interactions, with negative constraints, structured JSON outputs, and multi-hypothesis sampling emerging as critical techniq…