PulseAugur
LIVE 06:42:39
research · [2 sources] ·
0
research

Study finds 3-9B LLMs fail verbal confidence tests, impacting uncertainty estimates

A new study examined the verbal confidence of seven instruction-tuned, open-weight large language models (LLMs) with 3-9 billion parameters. Researchers found that these models failed to meet minimal validity criteria for expressing uncertainty, with all models deemed invalid on numeric confidence elicitation. Attempts to improve confidence reporting using categorical elicitation disrupted task performance in most models, leading to accuracy below 5%. The study suggests that current methods of verbal confidence elicitation are insufficient for capturing internal uncertainty signals in models of this size. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights limitations in current LLM confidence reporting, suggesting a need for improved methods before downstream use.

RANK_REASON Academic paper detailing experimental findings on LLM confidence elicitation.

Read on arXiv cs.CL →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 · Jon-Paul Cacioli ·

    Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

    arXiv:2604.22215v1 Announce Type: new Abstract: Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal …

  2. arXiv cs.CL TIER_1 · Jon-Paul Cacioli ·

    Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

    Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimi…