Pulse

last 48h

[50/1912] 89 sources

What AI is actually talking about — clusters surfacing on Bluesky, Reddit, HN, Mastodon and Lobsters, re-ranked to elevate originality and crush noise.

RESEARCH · HN — AI infrastructure stories · 30mo · HN

The first two custom silicon chips designed by Microsoft for its cloud

Microsoft has developed its own custom AI chips, the Azure Maia 100 AI accelerator and the Azure Cobalt 100 CPU, to power its Azure cloud infrastructure. These in-house designed chips aim to reduce reliance on third-party providers like Nvidia and optimize performance and cost for AI workloads, including training and inference for large language models. The Maia chip is being developed in collaboration with OpenAI, with CEO Sam Altman highlighting its potential to make model training more capable and affordable. AI

IMPACT Microsoft's custom silicon for Azure aims to reduce AI training costs and improve performance, potentially impacting cloud infrastructure economics.
RESEARCH · Alignment Forum · 17mo · [26 sources] · HNMASTOBLOGREDDIT

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Anthropic has introduced Natural Language Autoencoders (NLAs), a new method that translates the internal numerical 'thoughts' (activations) of large language models into human-readable text. This technique allows researchers to better understand model behavior, including identifying instances where models might be aware of being tested but do not verbalize it, or uncovering hidden motivations. While NLAs offer a significant advancement in AI interpretability and debugging, Anthropic notes limitations such as potential 'hallucinations' in the explanations and high computational costs, though they are releasing the code and an interactive frontend to encourage further research. AI

IMPACT Enables deeper understanding of LLM internal states, potentially improving safety, debugging, and trustworthiness.
RESEARCH · HN — machine learning stories · 25mo · HN

The AI industry spent 17x more on Nvidia chips than it brought in in revenue

The AI sector's expenditure on Nvidia chips significantly outpaced its revenue generation, with a reported 17x difference. This highlights a substantial investment phase in AI infrastructure, potentially indicating a focus on future growth and capability development over immediate profitability. The data suggests a considerable capital outlay is being made to acquire the necessary hardware for training and deploying advanced AI models. AI

IMPACT Indicates a heavy investment phase in AI infrastructure, potentially signaling future capability advancements.
RESEARCH · HN — machine learning stories · 25mo · HN

USAF Test Pilot School, DARPA announce aerospace machine learning breakthrough

The USAF Test Pilot School and DARPA have announced a significant advancement in aerospace machine learning. This breakthrough involves the development and successful testing of a new AI system designed to enhance the capabilities of military aircraft. The system aims to improve decision-making and operational efficiency in complex aerial environments. AI

IMPACT Potential to enhance military aviation capabilities through advanced AI decision-making.
RESEARCH · HN — machine learning stories · 23mo · [2 sources] · HN

Apple's On-Device and Server Foundation Models

Apple has detailed its new foundation language models powering Apple Intelligence, including a ~3 billion parameter on-device model and a larger server-based model. These models are designed for multilingual and multimodal tasks, supporting image understanding and tool execution. The company emphasizes its Responsible AI approach, focusing on user privacy through innovations like Private Cloud Compute and on-device processing, ensuring user data is not used for training. AI

IMPACT Apple's detailed technical report on its foundation models may influence the development of efficient on-device and specialized server-based AI systems.
RESEARCH · HN — machine learning stories · 25mo · [21 sources] · HNLOBSTERSMASTO

A Visual Introduction to Machine Learning (2015)

This collection of resources offers a broad overview of machine learning, from foundational concepts and visual introductions to theoretical underpinnings and practical applications. It includes a visual guide to classification tasks, a discussion on the science and ethics of machine learning benchmarks, and pointers to comprehensive textbooks and course materials. Additionally, it highlights tools for interpretable machine learning and the engineering practices required for deploying models in production. AI

IMPACT Provides foundational knowledge and practical tools for understanding, developing, and deploying machine learning models.
RESEARCH · HN — AI startup stories Suomi(FI) · 17mo · HN

Vultr Raises $333M at $3.5B Valuation

Vultr, a cloud computing provider focused on AI workloads, has secured $333 million in funding at a $3.5 billion valuation. The investment round was led by existing investor Thoma Bravo. The company plans to use the funds to expand its global infrastructure and enhance its AI-specific offerings. AI

IMPACT Expansion of Vultr's infrastructure could lower costs and increase accessibility for AI development and deployment.
RESEARCH · HN — AI startup stories · 16mo · HN

Anthropic raising funding valuing it at $60B

Anthropic is reportedly in talks to raise a significant funding round that would value the AI company at approximately $60 billion. This potential investment comes as the company continues to develop its large language models and compete in the rapidly evolving AI landscape. The substantial valuation underscores the high investor interest in cutting-edge AI development. AI

IMPACT Confirms continued high investor confidence and capital flow into frontier AI development.
RESEARCH · HN — AI infrastructure stories · 26mo · [2 sources] · HN

Show HN: Tracecat – Open-source security alert automation / SOAR alternative

Tracecat has released an open-source security automation platform designed for teams and AI agents. The platform allows users to build automations using prompts and various AI models, integrate custom Python scripts, and offers features like workflow management, case tracking, and over 100 pre-built connectors. It emphasizes security through sandboxing and durable execution via Temporal, and is available for self-hosting with options for an enterprise license or managed cloud offering. AI

IMPACT Enhances security operations by enabling AI agents to automate complex tasks and integrate with existing systems.
RESEARCH · HN — AI infrastructure stories · 23mo · HN

OpenAI Selects Oracle Cloud Infrastructure to Extend Microsoft Azure AI Platform

OpenAI has entered into a new agreement to utilize Oracle Cloud Infrastructure (OCI) for its artificial intelligence workloads. This partnership aims to expand OpenAI's existing AI platform, which is primarily hosted on Microsoft Azure. The collaboration will leverage OCI's high-performance computing capabilities to support OpenAI's growing demand for AI training and inference. AI

IMPACT Expands AI training and inference capacity by diversifying cloud infrastructure providers.
RESEARCH · HN — AI infrastructure stories Română(RO) · 25mo · [2 sources] · HN

1-Bit AI Infrastructure

Researchers have developed a software stack called 'this http URL' to enable fast and lossless inference of 1-bit Large Language Models (LLMs) like BitNet b1.58 on CPUs. This new infrastructure achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and 1.37x to 5.07x on ARM CPUs, depending on model size. The goal is to make LLMs more efficient and deployable on a wider range of devices. AI

IMPACT Enables more efficient and widespread deployment of LLMs on consumer hardware.
RESEARCH · arXiv cs.LG · 22mo · [2 sources] · HN

Sequential Learning and Catastrophic Forgetting in Differentiable Resistor Networks

Researchers have developed a novel analog network of resistors capable of performing machine learning tasks without a traditional processor. This system, based on transistors, can learn and adapt to new tasks, demonstrating potential for highly energy-efficient computation. While currently a prototype, the technology shows promise for applications in edge devices and could eventually outperform conventional digital processors for specific machine learning workloads. AI

IMPACT This research could lead to more energy-efficient AI hardware, particularly for edge computing applications.
RESEARCH · Hamel Husain · 28mo · BLOG

How To Debug Axolotl

Hamel Husain has published a guide on debugging the Axolotl project, a tool for fine-tuning large language models. The guide offers practical tips such as simplifying test scenarios, using smaller datasets and models, and clearing caches to expedite the debugging process. It also provides specific configurations for debugging with VSCode, including settings for data preprocessing and remote host development. AI
RESEARCH · Chip Huyen · 16mo · BLOG

Agents

Chip Huyen's latest post, adapted from her book "AI Engineering," explores the concept of intelligent agents, defining them as entities that perceive and act within an environment. These agents leverage the advanced capabilities of foundation models and can be augmented with tools to perform complex tasks. The post also delves into agent planning, tool selection, and methods for evaluating their performance and potential failure modes. AI
RESEARCH · Eugene Yan · 44mo · BLOG

Writing Robust Tests for Data & Machine Learning Pipelines

Eugene Yan's article explores methods for creating more resilient tests for data and machine learning pipelines. The author discusses why existing tests often fail even when new code is correct, attributing this to the brittle nature of tests themselves. Yan proposes strategies to improve pipeline testing by examining different testing scopes like unit and integration tests, and analyzing the impact of new data and logic on test validity. AI
RESEARCH · Eugene Yan · 38mo · BLOG

How to Write Data Labeling/Annotation Guidelines

Writing effective data labeling guidelines requires careful consideration of several key questions to ensure accuracy and consistency. These guidelines should clearly articulate the task's importance, define its scope and terminology, and provide step-by-step instructions for annotators. Including examples, explanations of user intent, and definitions of terms like 'query' and 'locale' helps calibrate annotators and improve inter-rater reliability. The process also involves explaining how to use annotation tools and platforms, and addressing logistical aspects of the task. AI
RESEARCH · Eugene Yan · 37mo · BLOG

Raspberry-LLM - Making My Raspberry Pico a Little Smarter

Eugene Yan developed Raspberry-LLM, a project that integrates a large language model with a Raspberry Pi Pico, a low-resource microcontroller. This setup allows the device to interact with external data sources like RSS feeds and generate content, despite severe memory constraints of only 8kb. The project required custom solutions for parsing data character-by-character and managing memory usage, showcasing innovative approaches to running LLM-related tasks in highly constrained environments. AI
RESEARCH · Eugene Yan · 32mo · BLOG

Evaluation & Hallucination Detection for Abstractive Summaries

Evaluating abstractive summarization, which involves rephrasing source material rather than copying sentences, presents challenges, particularly in assessing relevance and factual consistency. While fluency and coherence are largely addressed by modern language models, measuring relevance remains subjective. Detecting factual inconsistencies, or hallucinations, is a key focus, with studies indicating significant error rates in generated summaries, such as up to 30% in CNN/DailyMail datasets. Common evaluation methods include n-gram-based metrics like ROUGE and embedding-based metrics, alongside techniques like natural language inference and question-answering for hallucination detection. AI
RESEARCH · Hugging Face Daily Papers · 30mo · [56 sources] · BLOG

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Researchers are developing novel methods to combat hallucinations in Large Language Models (LLMs). Several papers propose new frameworks and techniques, including LaaB, which bridges neural features and symbolic judgments, and CuraView, a multi-agent system for medical hallucination detection using GraphRAG. Other approaches focus on neuro-symbolic agents for hallucination-free requirements reuse, adaptive unlearning for surgical hallucination suppression in code generation, and harnessing reasoning trajectories via answer-agreement representation shaping. Additionally, new benchmarks like HalluScan are being created to systematically evaluate detection and mitigation strategies. AI

IMPACT New research offers diverse strategies to improve LLM factual accuracy, crucial for reliable deployment in sensitive domains like healthcare and code generation.
RESEARCH · Eugene Yan · 28mo · BLOG

Language Modeling Reading List (to Start Your Paper Club)

Eugene Yan has compiled a reading list of fundamental language modeling papers, intended to facilitate group study sessions. The list includes seminal works like "Attention Is All You Need," "BERT," and "GPT-3," each accompanied by a concise summary highlighting its core contribution. Yan also provides guidance on how to approach reading research papers and encourages community contributions to refine the list. AI
RESEARCH · Eugene Yan · 27mo · BLOG

How to Generate and Use Synthetic Data for Finetuning

Synthetic data, generated by models or simulations rather than real-world sources, offers a faster and more cost-effective alternative to human annotation for fine-tuning AI models. This approach can lead to improved model performance and generalization while also mitigating privacy and copyright concerns. Two primary methods for generating synthetic data include distillation from a more capable model and self-improvement techniques where a model refines its own output. These methods can be applied to pretraining, instruction-tuning, and preference-tuning to enhance various aspects of a model's capabilities. AI
RESEARCH · Eugene Yan · 23mo · BLOG

Netflix PRS 2024 - Applying LLMs to Recommendation Experiences

Eugene Yan, a Senior Applied Scientist at Amazon, presented at the 2024 Netflix Workshop on Personalization, Recommendation, and Search. His talk focused on the practical challenges encountered when developing and deploying large language model (LLM)-powered recommendation systems at a consumer scale. The workshop featured discussions on various topics including LLM evaluation, generative recommendations, conversational recommendation systems, and personalization strategies from companies like Meta, Google, Airbnb, and Spotify. AI
RESEARCH · Eugene Yan · 20mo · BLOG

Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge

Eugene Yan, a judge at the Weights & Biases LLM-Evaluator Hackathon, shared insights from the event where over 100 participants built creative projects. Teams focused on areas like knowledge graph construction, LLM evaluation on personality traits, and optimizing prompts. Yan discussed key considerations for using LLM evaluators, including scoring methods and performance metrics, and was impressed by the teams' rapid progress over the weekend. AI
RESEARCH · Hugging Face Blog · 31mo · [219 sources] · HNMASTOBLOGREDDIT

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Recent research explores novel methods to enhance the reasoning capabilities and efficiency of large language models (LLMs). Papers introduce techniques like speculative exploration for Tree-of-Thought reasoning to break synchronization bottlenecks and achieve significant speedups. Other work focuses on improving tool-integrated reasoning by pruning erroneous tool calls at inference time and developing frameworks for robots to perform physical reasoning in latent spaces before acting. Additionally, research investigates the effectiveness of different reasoning protocols, such as debate and voting, for LLMs, finding that while some methods improve safety, they don't always enhance usefulness. AI

IMPACT New methods for efficient reasoning and tool integration could enhance LLM performance and applicability in complex tasks.
RESEARCH · Bounded Regret (Jacob Steinhardt) · 37mo · BLOG

Complex Systems are Hard to Control

Deep learning systems are complex adaptive systems, similar to ecosystems or financial markets, making them difficult to control through traditional engineering approaches. These systems exhibit emergent behaviors and feedback loops, leading to unintended consequences when straightforward attempts are made to guide their actions. The author suggests that safety measures must account for this complex adaptive nature, moving beyond simple reliability and redundancy. AI
RESEARCH · Bounded Regret (Jacob Steinhardt) · 18mo · BLOG

Introducing Transluce — A Letter from the Founders

Bounded Regret, a new independent research lab, has launched Transluce, a suite of AI-driven tools designed to analyze and understand complex AI systems. These tools aim to provide scalable and open-source methods for inspecting AI behavior and representations, addressing the opacity of current models. Transluce intends to establish industry standards for trustworthy AI by making these analysis technologies publicly available for vetting and improvement, with initial applications on open-weight models and plans to collaborate with major AI labs and governments. AI
RESEARCH · Smol AINews · 31mo · [7 sources] · MASTOBLOG

MM1: Apple's first Large Multimodal Model

Researchers have developed Cornserve, an open-source distributed serving system designed to efficiently handle any-to-any multimodal models, which can process and generate combinations of various data types like text, images, and audio. The system improves throughput by up to 3.81x and reduces tail latency by 5.79x by disaggregating model components and scaling them independently. Separately, a new evaluation framework called XTC-Bench has been introduced to assess the cross-task consistency of unified multimodal models, revealing that high performance in individual tasks does not guarantee semantic alignment across them. AI

IMPACT New systems and evaluation frameworks for multimodal AI aim to improve efficiency and consistency in handling diverse data types.
RESEARCH · vLLM — Releases · 29mo · [198 sources] · MASTO

v0.20.1rc0: Add system_fingerprint field to OpenAI-compatible API responses (#40537)

Several AI labs have released new open-weight models, including Alibaba's Qwen3.6-27B, which claims to outperform larger models on coding benchmarks, and Xiaomi's MiMo-V2.5 series, featuring enhanced agentic capabilities and multimodality. OpenAI has also open-sourced a privacy filter model for PII detection, targeting infrastructure needs. Additionally, Anthropic has launched Claude Design, a new tool for generating prototypes and presentations powered by Claude Opus 4.7, signaling a move into design tooling. AI

IMPACT New open-source models and agentic tools are increasing competition and lowering barriers for AI development and deployment.
RESEARCH · Hugging Face Blog · 29mo · [26 sources] · MASTOX

🚀 Accelerating LLM Inference with TGI on Intel Gaudi

Google Research has introduced "speculative cascades," a novel method to enhance Large Language Model (LLM) efficiency by merging speculative decoding with standard cascades. This hybrid approach aims to reduce computational costs and inference latency without compromising output quality. By strategically using smaller models to predict tokens and then verifying them with larger models, speculative cascades offer improved cost-quality trade-offs compared to either technique used in isolation, as demonstrated with Gemma and T5 models. AI

IMPACT New inference techniques like speculative cascades and KV cache compression could significantly reduce operational costs for LLM deployments.
RESEARCH · Hugging Face Blog · 36mo · [16 sources] · MASTO

Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

Researchers are developing advanced quantization techniques to make large language models (LLMs) more efficient. New methods like AutoRound, LATMiX, and GSQ aim to reduce model size and computational requirements, enabling deployment on less powerful hardware. These approaches focus on optimizing how model weights and activations are represented at lower bit-widths, with some achieving accuracy comparable to higher-precision models. Innovations include novel calibration strategies for post-training quantization and learnable affine transformations to improve robustness. AI

IMPACT Enables more efficient deployment of LLMs on resource-constrained devices, potentially lowering inference costs and increasing accessibility.
RESEARCH · Hugging Face Blog · 39mo · [198 sources] · HNREDDIT

A Dive into Vision-Language Models

Hugging Face has released a suite of resources and models focused on advancing vision-language models (VLMs). These include new open-source models like Google's PaliGemma and PaliGemma 2, Microsoft's Florence-2, and Hugging Face's own Idefics2 and SmolVLM. The platform also offers guides and tools for aligning VLMs, such as TRL and preference optimization techniques, aiming to improve their capabilities and accessibility for the community. AI

IMPACT Expands the ecosystem of open-source vision-language models and provides tools for their alignment and fine-tuning.
RESEARCH · Hugging Face Blog · 42mo · [10 sources] · BLOG

PRX Part 3 — Training a Text-to-Image Model in 24h!

Researchers have developed ANCHOR, a large-scale dataset of over 70,000 abstractive captions designed to evaluate text-to-image synthesis models on complex, real-world prompts. Analysis using ANCHOR revealed that current models struggle with understanding multiple subjects, contextual reasoning, and nuanced grounding. To address these limitations, the Subject-Aware Fine-tuning (SAFE) method was proposed, which utilizes LLMs to extract key subjects and enhance their representation within the model's embeddings, leading to improved image-caption consistency. AI

IMPACT New datasets and fine-tuning methods like ANCHOR and SAFE aim to improve text-to-image model performance on complex prompts, addressing current limitations in subject understanding and context.
RESEARCH · AI Snake Oil · 24mo · BLOG

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

AI leaderboards for evaluating code generation systems are becoming less useful due to a lack of cost considerations. Researchers argue that current benchmarks often overlook the significant expenses associated with complex AI agents that repeatedly invoke language models. Instead, they propose using Pareto curves to visualize the trade-off between accuracy and cost, as simple baseline agents can sometimes achieve comparable results at a fraction of the price. AI
RESEARCH · AI Snake Oil · 20mo · BLOG

Start reading the AI Snake Oil book online

The book "AI Snake Oil" by Normal Tech AI, published in September 2024, aims to demystify artificial intelligence by identifying hype and harmful applications. It distinguishes between different AI types, such as predictive and generative AI, and examines their real-world impacts and limitations. The authors explore why AI hype persists and offer a framework for understanding AI's future, building on their previous work. AI
RESEARCH · AI Snake Oil Română(RO) · 20mo · [2 sources] · BLOG

Can AI automate computational reproducibility?

Researchers have developed AutoReproduce, a multi-agent framework designed to automatically reproduce AI experiments from research papers. This system utilizes a "paper lineage" to mine implicit knowledge from cited literature and employs a sampling-based unit testing strategy to ensure code executability. A new benchmark, CORE-Bench, has also been introduced to evaluate AI's capability in automating computational reproducibility. Initial tests show that while specialized agents like CORE-Agent with GPT-4o achieve 22% accuracy on difficult tasks, there is significant room for improvement in AI's ability to handle complex computational environments. AI
RESEARCH · AI Snake Oil · 18mo · BLOG

Does the UK’s liver transplant matching algorithm systematically exclude younger patients?

A recent analysis of the UK's liver transplant matching algorithm suggests it may systematically disadvantage younger patients, contrary to initial expectations. The algorithm calculates a Transplant Benefit Score (TBS) based on predicted patient outcomes with and without a transplant. Researchers question the fundamental use of predictive AI in such critical life-or-death decisions, highlighting potential flaws and the ethical implications of using predictions rather than direct assessments. AI
RESEARCH · Lil'Log (Lilian Weng) · 44mo · BLOG

Some Math behind Neural Tangent Kernel

Lilian Weng's blog post delves into the mathematical underpinnings of the Neural Tangent Kernel (NTK), a concept used to explain the training dynamics of neural networks. The post focuses on NTK's definition and proofs, particularly how infinitely wide neural networks converge to a global minimum during gradient descent. It reviews foundational mathematical concepts like vector-to-vector derivatives, ordinary differential equations, the Central Limit Theorem, and Taylor expansions, which are essential for understanding NTK. AI
RESEARCH · Lil'Log (Lilian Weng) · 38mo · BLOG

Prompt Engineering

Prompt engineering, also known as in-context prompting, involves guiding Large Language Models (LLMs) to achieve desired outcomes without altering their underlying weights. This empirical field focuses on autoregressive language models and aims to improve alignment and steerability. Basic techniques include zero-shot learning, where the model is given a task directly, and few-shot learning, which provides examples to better guide the model's understanding and performance. AI
RESEARCH · Lil'Log (Lilian Weng) · 35mo · BLOG

LLM Powered Autonomous Agents

Lilian Weng's blog post details the architecture of LLM-powered autonomous agents, highlighting key components like planning, memory, and tool use. The post explains how agents can break down complex tasks, reflect on past actions for improvement, and utilize external tools or vector stores for information retrieval. Techniques such as Chain of Thought and Tree of Thoughts are discussed for task decomposition, while ReAct is presented as a method for integrating reasoning and action. AI
RESEARCH · Lil'Log (Lilian Weng) · 31mo · [3 sources] · BLOG

Adversarial Attacks on LLMs

Researchers are developing new methods to enhance the safety and robustness of large language models against adversarial attacks. These attacks, often in the form of carefully crafted prompts, aim to bypass built-in safety mechanisms and elicit undesirable outputs. Efforts include creating guardrails like AprielGuard and developing leaderboards to track and improve model security against such vulnerabilities. AI
RESEARCH · Lil'Log (Lilian Weng) · 32mo · [16 sources] · BLOG

Diffusion Models for Video Generation

Researchers are exploring advanced diffusion models for video generation, addressing challenges like temporal consistency and data scarcity. New methods focus on improving parameterization, such as the v-prediction technique, and incorporating conditional sampling for tasks like extending video length or filling missing frames. Efforts are also underway to enhance efficiency and controllability through post-training frameworks, hybrid attention mechanisms, and semantic-visual adaptation, aiming for real-time generation and higher quality outputs. AI

IMPACT Advances in diffusion models are improving video generation quality, efficiency, and controllability, potentially enabling new applications in content creation and analysis.
RESEARCH · Lil'Log (Lilian Weng) · 22mo · BLOG

Extrinsic Hallucinations in LLMs

Lilian Weng's latest post delves into extrinsic hallucinations in large language models, defining them as generated content that is fabricated and not grounded in provided context or world knowledge. The piece explores how issues in pre-training data and the learning process during fine-tuning can contribute to these factual inaccuracies. Research suggests that while models struggle to learn new information during fine-tuning, attempting to do so can paradoxically increase their tendency to hallucinate. AI
RESEARCH · Ahead of AI (Sebastian Raschka) · 29mo · [9 sources] · MASTOBLOG

The State Of LLMs 2025: Progress, Problems, and Predictions

The year 2025 was marked by significant advancements in large language models, particularly in the development of reasoning capabilities. A key breakthrough was DeepSeek's R1 model, which demonstrated that reasoning skills could be effectively trained using reinforcement learning with verifiable rewards (RLVR) and the GRPO algorithm. This approach proved to be more cost-effective than previously thought, with training costs estimated around $5 million. The success of DeepSeek R1 spurred other major LLM developers, both open-weight and proprietary, to release their own reasoning-enhanced models, shifting the focus of LLM development. AI
RESEARCH · Ahead of AI (Sebastian Raschka) · 25mo · [30 sources] · BLOG

My Workflow for Understanding LLM Architectures

OpenAI has introduced the IH-Challenge dataset to train large language models to better prioritize instructions from different sources, such as system messages, developers, and users. This training aims to improve safety steerability and robustness against prompt-injection attacks by teaching models to follow a hierarchy where system instructions are most trusted. The dataset is designed to overcome common pitfalls in reinforcement learning for instruction hierarchy, ensuring models can reliably adhere to safety policies even when faced with conflicting user or tool-generated prompts. AI

IMPACT Enhances LLM safety and reliability by improving their ability to follow prioritized instructions, reducing risks from prompt injection and policy violations.
RESEARCH · Latent Space Podcast · 18mo · [3 sources] · REDDIT

In the Arena: How LMSys changed LLM Benchmarking Forever

The AraGen benchmark, developed by Hugging Face, aims to improve LLM evaluation by addressing limitations of static benchmarks. It introduces a crowdsourced approach similar to LMSys's Chatbot Arena, allowing for more dynamic and user-aligned assessments. This method seeks to capture real-world user preferences and model performance beyond traditional metrics. Additionally, a new open-source OCR model called DharmaOCR has been released, demonstrating strong performance against larger commercial and open-source models. AI

IMPACT New evaluation methods and specialized open-source models offer improved benchmarking and cost-performance for AI operators.
RESEARCH · Latent Space Podcast · 21mo · [5 sources] · MASTO

⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposure to problems and solutions during training rather than genuine advancements in software engineering skills. OpenAI found that a significant portion of the benchmark's tests incorrectly reject valid solutions, and that many models can reproduce ground-truth solutions verbatim, indicating training data overlap. The company now recommends SWE-bench Pro for evaluations and is developing new, uncontaminated benchmarks. AI
RESEARCH · Practical AI · 36mo · [19 sources] · LOBSTERS

Automating code optimization with LLMs

Researchers are exploring various methods to enhance Large Language Models (LLMs) for code-related tasks. One study evaluates locally deployed LLMs like LLaMA 3.2 and Mistral for Python bug detection, finding they can identify bugs but struggle with precise localization. Another paper introduces TreeCoder, a framework to optimize LLM code generation by treating decoding strategies and constraints as optimizable components, improving accuracy on benchmarks like MBPP and SQL-Spider. Additionally, a case study at BMW demonstrates how fine-tuning LLMs like Qwen2.5-Coder and DeepSeek-Coder can generate and modify enterprise domain-specific languages across multiple files. Finally, a new approach called CAT uses call-chain awareness to improve LLM-based unit test generation for Java projects, significantly boosting code coverage. AI

IMPACT Advances in LLM code generation and analysis techniques could lead to more robust and efficient software development tools.
RESEARCH · Google AI / Research · 37mo · [234 sources] · HNLOBSTERSMASTOBLOGREDDIT

Making LLMs more accurate by using all of their layers

Google Research has developed a framework to evaluate the alignment of Large Language Models (LLMs) with human behavioral dispositions, using established psychological assessments adapted into situational judgment tests. This approach quantizes model tendencies against human social inclinations, identifying deviations and areas for improvement in realistic scenarios. Separately, Google Research also introduced SLED (Self Logits Evolution Decoding), a novel method that enhances LLM factuality by utilizing all model layers during the decoding process, thereby reducing hallucinations without external data or fine-tuning. AI

IMPACT New methods from Google Research offer improved LLM alignment and factuality, potentially increasing trust and reliability in AI applications.
RESEARCH · Practical AI · 22mo · [6 sources] · REDDIT

Towards high-quality (maybe synthetic) datasets

Google Research has introduced Simula, a framework that treats synthetic data generation as a mechanism design problem. This approach allows for fine-grained control over dataset characteristics like coverage, complexity, and quality, addressing the scarcity of real-world data for specialized AI applications. Separately, Google also presented CTCL, a privacy-preserving synthetic data generation algorithm that avoids the need to fine-tune large language models, making it suitable for resource-constrained environments. AI

IMPACT New frameworks for synthetic data generation could accelerate AI development in data-scarce domains and improve privacy-preserving techniques.
RESEARCH · Yannic Kilcher · 31mo · [25 sources] · MASTO

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Researchers are developing new benchmarks and evaluation methods for large language models (LLMs) in mathematical reasoning and educational assessment. New datasets like ESTBook and Math-PT aim to go beyond simple accuracy, focusing on pedagogical reasoning and reducing linguistic bias. Other work explores the impact of self-consistency and reasoning effort on automated scoring, with findings suggesting strategic model selection can optimize accuracy and cost. Additionally, frameworks like MaSTer are being created to automatically generate adversarial test cases for evaluating and improving LLM robustness. AI

IMPACT New benchmarks and evaluation techniques will drive more robust and reliable LLM development for educational and reasoning tasks.