Researchers have introduced ExCyTIn-Bench, a new benchmark designed to evaluate Large Language Model (LLM) agents in the domain of cyber threat investigation. This benchmark utilizes security logs from a controlled Azure tenant, including Microsoft Sentinel data, to construct threat investigation graphs. The system generates questions based on these graphs, providing explainable ground truth answers and allowing for extensibility to new log types. Current evaluations show that even the best-performing models achieve a score of 0.606, indicating significant room for improvement in this challenging task. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a new evaluation framework for LLM agents in cybersecurity, highlighting current performance limitations and future research directions.
RANK_REASON This is a research paper introducing a new benchmark for evaluating LLM agents on a specific task.