AI research tackles LLM context, social agents, and evaluation benchmarks

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 18 sources

Researchers are developing new methods to evaluate and improve Large Language Models (LLMs). One paper introduces a benchmark to assess LLMs' contextual understanding, finding that quantized models show performance degradation. Another research effort focuses on segmenting human-authored text from LLM-generated content using change point detection, addressing the need for authenticity. Additionally, a framework called LongSumEval is proposed for evaluating long document summarization by using question-answering feedback to guide refinement and ensure factual accuracy. AI

Summary written by gemini-2.5-flash-lite from 18 sources. How we write summaries →

IMPACT Advances in LLM evaluation and refinement are crucial for developing more reliable and trustworthy AI systems across various applications.

RANK_REASON Multiple research papers are presented on evaluating and improving LLM capabilities, including context understanding, text segmentation, and summarization.

Read on Apple Machine Learning Research →

paper
other

AI research tackles LLM context, social agents, and evaluation benchmarks

COVERAGE [18]

Apple Machine Learning Research TIER_1 · 2026-04-21 00:00

Can Large Language Models Understand Context?

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language …
Hugging Face Blog TIER_1 · 2025-04-16 00:00

Introducing HELMET: Holistically Evaluating Long-context Language Models
arXiv cs.CL TIER_1 · Arnault Chatelain, \'Etienne Ollion, Qianwen Guan, Diandra Fabre, Lorraine Goeuriot, Emile Chapuis, Abdelkrim Beloued, Marie Candito, Nicolas Herv\'e, Didier Schwab · 2026-05-07 04:00

BenCSSmark: Making the Social Sciences Count in LLM Research

arXiv:2605.04886v1 Announce Type: new Abstract: This position paper argues that the under-representation of social science tasks in contemporary LLM benchmarks limits advances in both LLM evaluation and social scientific inquiry. Benchmarks -- standardized tools for assessing com…
arXiv cs.CL TIER_1 · Didier Schwab · 2026-05-06 13:20

BenCSSmark: Making the Social Sciences Count in LLM Research

This position paper argues that the under-representation of social science tasks in contemporary LLM benchmarks limits advances in both LLM evaluation and social scientific inquiry. Benchmarks -- standardized tools for assessing computational systems -- are pivotal in the develop…
arXiv cs.CL TIER_1 · Mengchu Li, Jin Zhu, Jinglai Li, Chengchun Shi · 2026-05-06 04:00

Segmenting Human-LLM Co-authored Text via Change Point Detection

arXiv:2605.03723v1 Announce Type: new Abstract: The rise of large language models (LLMs) has created an urgent need to distinguish between human-written and LLM-generated text to ensure authenticity and societal trust. Existing detectors typically provide a binary classification …
arXiv cs.AI TIER_1 (CA) · \"Onder G\"urcan, Moharram Challenger · 2026-05-06 04:00

LLM-enabled Social Agents

arXiv:2605.02335v1 Announce Type: cross Abstract: Large Language Models (LLMs) have transformed agent-agent and human-agent interaction by enabling software, physical, and simulation agents to communicate and deliberate through natural language. Yet fluent language use does not b…
arXiv cs.CL TIER_1 · Chengchun Shi · 2026-05-05 13:08

Segmenting Human-LLM Co-authored Text via Change Point Detection

The rise of large language models (LLMs) has created an urgent need to distinguish between human-written and LLM-generated text to ensure authenticity and societal trust. Existing detectors typically provide a binary classification for an entire passage; however, this is insuffic…
arXiv cs.CL TIER_1 · Ewelina Gajewska, Michal Wawer, Katarzyna Budzynska, Jaroslaw A. Chudziak · 2026-05-05 04:00

Who Decides What Is Harmful? Content Moderation Policy Through A Multi-Agent Personalised Inference Framework

arXiv:2605.01416v1 Announce Type: cross Abstract: The increasing scale and complexity of online platforms raises critical policy questions around harmful content, digital well-being, and user autonomy. Traditional content moderation systems rely on centralised, top-down rules, of…
Hugging Face Daily Papers TIER_1 (CA) · 2026-05-04 08:39

LLM-enabled Social Agents

Large Language Models (LLMs) have transformed agent-agent and human-agent interaction by enabling software, physical, and simulation agents to communicate and deliberate through natural language. Yet fluent language use does not by itself yield socially intelligible behaviour. Mo…
arXiv cs.CL TIER_1 · Huyen Nguyen, Haoxuan Zhang, Yang Zhang, Haihua Chen, Junhua Ding · 2026-04-29 04:00

LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

arXiv:2604.25130v1 Announce Type: new Abstract: Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement,…
arXiv cs.CL TIER_1 · Miriam Winkler, Verena Blaschke, Barbara Plank · 2026-04-28 04:00

Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

arXiv:2603.15130v2 Announce Type: replace Abstract: Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indire…
Hugging Face Daily Papers TIER_1 · 2026-04-28 02:07

LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications…
arXiv cs.CL TIER_1 · Junhua Ding · 2026-04-28 02:07

LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications…
arXiv cs.CL TIER_1 · Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam · 2026-04-27 04:00

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

arXiv:2604.22294v1 Announce Type: new Abstract: Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections g…
arXiv cs.CL TIER_1 · Youmi Ma, Naoaki Okazaki · 2026-04-27 04:00

From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models

arXiv:2601.11020v3 Announce Type: replace Abstract: Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improvin…
arXiv cs.CL TIER_1 · Monica S. Lam · 2026-04-24 07:16

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documen…
Smol AINews TIER_1 · 2025-06-25 05:44

Context Engineering: Much More than Prompts

**Context Engineering** emerges as a significant trend in AI, highlighted by experts like **Andrej Karpathy**, **Walden Yan** from **Cognition**, and **Tobi Lutke**. It involves managing an LLM's context window with the right mix of prompts, retrieval, tools, and state to optimiz…
Eugene Yan TIER_1 · 2025-06-22 00:00

Evaluating Long-Context Question & Answer Systems

Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.

COVERAGE [18]

RELATED ENTITIES

RELATED TOPICS