A new study examined the verbal confidence of seven instruction-tuned, open-weight large language models (LLMs) with 3-9 billion parameters. Researchers found that these models failed to meet minimal validity criteria for expressing uncertainty, with all models deemed invalid on numeric confidence elicitation. Attempts to improve confidence reporting using categorical elicitation disrupted task performance in most models, leading to accuracy below 5%. The study suggests that current methods of verbal confidence elicitation are insufficient for capturing internal uncertainty signals in models of this size. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Highlights limitations in current LLM confidence reporting, suggesting a need for improved methods before downstream use.
RANK_REASON Academic paper detailing experimental findings on LLM confidence elicitation.