Study finds 3-9B LLMs fail verbal confidence tests, impacting uncertainty estimates

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

A new study examined the verbal confidence of seven instruction-tuned, open-weight large language models (LLMs) with 3-9 billion parameters. Researchers found that these models failed to meet minimal validity criteria for expressing uncertainty, with all models deemed invalid on numeric confidence elicitation. Attempts to improve confidence reporting using categorical elicitation disrupted task performance in most models, leading to accuracy below 5%. The study suggests that current methods of verbal confidence elicitation are insufficient for capturing internal uncertainty signals in models of this size. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights limitations in current LLM confidence reporting, suggesting a need for improved methods before downstream use.

RANK_REASON Academic paper detailing experimental findings on LLM confidence elicitation.

Read on arXiv cs.CL →

paper
safety

COVERAGE [2]

arXiv cs.CL TIER_1 · Jon-Paul Cacioli · 2026-04-27 04:00

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

arXiv:2604.22215v1 Announce Type: new Abstract: Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal …
arXiv cs.CL TIER_1 · Jon-Paul Cacioli · 2026-04-24 04:45

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimi…

COVERAGE [2]

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

RELATED ENTITIES

RELATED TOPICS