LLMs show mixed reliability for mental health screening

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new research paper investigates the reliability of large language models (LLMs) for mental health screening, specifically their ability to estimate anxiety and depression scores from speech. The study evaluated three LLMs—Phi-4, Gemma-2-9B, and Llama-3.1-8B—assessing their consistency, robustness to automatic speech recognition (ASR) errors, and faithfulness to evidence. While Phi-4 and Gemma-2-9B demonstrated strong consistency and maintained predictive validity even with ASR errors, Llama-3.1-8B showed significant degradation in consistency when faced with higher word error rates. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights the need for robust LLM evaluation before clinical deployment in sensitive areas like mental health.

RANK_REASON Academic paper evaluating LLM performance on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

COVERAGE [1]

arXiv cs.CL TIER_1 · Saturnino Luz · 2026-05-10 16:23

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-…

COVERAGE [1]

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

RELATED ENTITIES

RELATED TOPICS