LLM judges
PulseAugur coverage of LLM judges — every cluster mentioning LLM judges across labs, papers, and developer communities, ranked by signal.
3 day(s) with sentiment data
-
New framework audits LLM judge rubrics for reliability and robustness
Researchers have developed PReMISE, a framework designed to evaluate the effectiveness of rubrics used by Large Language Model (LLM) judges. The framework treats rubrics as measurement specifications, analyzing their st…
-
LLM judges show rationalization bias, new framework reveals
Researchers have developed a causal framework to analyze rationalization bias in large language models (LLMs) when they act as judges for text evaluation. The study introduces new metrics and cue interventions to test i…
-
New framework tackles preference cycles in AI feedback
Researchers have developed a new framework called Topological Consensus Rewards (TCR) to improve the stability of Reinforcement Learning from AI Feedback (RLAIF). This method addresses the issue of preference cycles, wh…
-
New benchmark reveals LLM judges unreliable for research agents
Researchers have developed a new benchmark called REFLECT to evaluate the reliability of Large Language Models (LLMs) when used as judges for deep research agents. These agents automate complex information-seeking tasks…
-
LLM judges evaluate agentic stock predictors, improving accuracy via reinforcement learning
Researchers have developed a novel framework for evaluating agentic stock prediction systems by utilizing large language models as judges. This system breaks down performance into six specific dimensions, including regi…