A new research paper investigates the reliability of large language models (LLMs) for mental health screening, specifically their ability to estimate anxiety and depression scores from speech. The study evaluated three LLMs—Phi-4, Gemma-2-9B, and Llama-3.1-8B—assessing their consistency, robustness to automatic speech recognition (ASR) errors, and faithfulness to evidence. While Phi-4 and Gemma-2-9B demonstrated strong consistency and maintained predictive validity even with ASR errors, Llama-3.1-8B showed significant degradation in consistency when faced with higher word error rates. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights the need for robust LLM evaluation before clinical deployment in sensitive areas like mental health.
RANK_REASON Academic paper evaluating LLM performance on a specific task. [lever_c_demoted from research: ic=1 ai=1.0]