Researchers have developed a new benchmark called LLM-S^3 to evaluate how well large language models can simulate human respondents in surveys. The benchmark includes 11 real-world datasets across various sociological domains. Experiments using GPT-3.5/4 Turbo and LLaMA 3.0/3.1-8B showed consistent performance trends and highlighted how prompt design impacts simulation accuracy. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a new benchmark for evaluating LLM simulation capabilities, potentially improving data collection methods in social sciences.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs in survey simulation.