New benchmark evaluates LLM agents for cyber threat investigation tasks

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced ExCyTIn-Bench, a new benchmark designed to evaluate Large Language Model (LLM) agents in the domain of cyber threat investigation. This benchmark utilizes security logs from a controlled Azure tenant, including Microsoft Sentinel data, to construct threat investigation graphs. The system generates questions based on these graphs, providing explainable ground truth answers and allowing for extensibility to new log types. Current evaluations show that even the best-performing models achieve a score of 0.606, indicating significant room for improvement in this challenging task. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a new evaluation framework for LLM agents in cybersecurity, highlighting current performance limitations and future research directions.

RANK_REASON This is a research paper introducing a new benchmark for evaluating LLM agents on a specific task.

Read on arXiv cs.CL →

paper
other

COVERAGE [1]

arXiv cs.CL TIER_1 · Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Ra\'ul Mel\'endez Luj\'an, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, Julia Kiseleva, Anand Mudgerikar · 2026-05-04 04:00

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

arXiv:2507.14201v3 Announce Type: replace-cross Abstract: We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent X on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift throug…

COVERAGE [1]

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

RELATED ENTITIES

RELATED TOPICS