ENTITY HumanEval

HumanEval

PulseAugur coverage of HumanEval — every cluster mentioning HumanEval across labs, papers, and developer communities, ranked by signal.

Total · 30d

15

15 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

13

13 over 90d

TIER MIX · 90D

frontier release 1
research 5
tool 7
commentary 2

SENTIMENT · 30D

2 day(s) with sentiment data

RECENT · PAGE 1/1 · 13 TOTAL

RESEARCH · CL_28917 · May 12 · 18:46

New RL method teaches LLMs to self-correct answers

Researchers have developed SCoRe, a novel two-stage reinforcement learning technique that enables language models to refine their own responses using self-generated data. This method significantly improves performance o…
TOOL · CL_27577 · May 10 · 22:00

Neuroevolution framework boosts LLM output diversity via prompt embedding evolution

Researchers have developed QD-LLM, a novel framework that uses parameter-efficient neuroevolution to enhance the diversity of outputs from large language models. This method evolves compact prompt embeddings, which act …
SIGNIFICANT · CL_22783 · May 8 · 10:04

OpenAI's GPT-5.5 prioritizes reliability for production AI agents over benchmarks

OpenAI has released GPT-5.5, which reportedly excels not in benchmark scores but in practical reliability for complex tasks. The new model demonstrates significantly improved instruction following, reduced hallucination…
COMMENTARY · CL_20705 · May 7 · 04:27

AI models: Choose benchmarks over hype for true performance

A recent analysis highlights that tech companies often select AI models based on hype rather than performance on relevant benchmarks. The article emphasizes that benchmarks like SWE-bench for coding, Terminal-Bench for …
TOOL · CL_18865 · May 6 · 04:00

ReCode framework enhances AI code generation by rewarding reasoning processes

Researchers have developed ReCode, a novel reinforcement learning framework designed to improve code generation by focusing on the reasoning process. This framework uses Contrastive Reasoning-Process Reward Learning (CR…
RESEARCH · CL_15893 · May 5 · 04:00

MolViBench benchmark evaluates LLMs on molecular coding tasks for drug discovery

Researchers have introduced MolViBench, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in molecular coding tasks. This benchmark addresses the gap left by existing evaluations, w…
RESEARCH · CL_13613 · May 3 · 09:42

Vintage AI trained on 1930s data learns to code and fix software bugs

Researchers have fine-tuned a large language model, Talkie-1930-13B, trained only on data predating 1931, to perform software engineering tasks. Despite its limited knowledge base, the model successfully patched a bug i…
RESEARCH · CL_11738 · May 1 · 04:00

BoostLoRA method grows adapter rank to surpass full fine-tuning

Researchers have introduced BoostLoRA, a novel parameter-efficient fine-tuning method designed to enhance model expressivity without increasing inference overhead. This technique iteratively trains and merges small adap…
RESEARCH · CL_07050 · Apr 28 · 04:00

Researchers generate verifiable code reasoning data to boost LLM performance

Researchers have developed a new method to generate verifiable Chain-of-Thought (CoT) rationales for code reasoning by instrumenting code to capture execution traces. This pipeline narrates these traces into natural lan…
RESEARCH · CL_06927 · Apr 27 · 04:00

Think Anywhere in Code Generation

Researchers have introduced "Think-Anywhere," a new reasoning mechanism for large language models that allows them to generate code by thinking at any point during the process, rather than just upfront. This approach ha…
RESEARCH · CL_05211 · Apr 27 · 04:00

Language agents use auction to cut communication costs and boost reasoning

Researchers have developed a new framework called DALA (Dynamic Auction-based Language Agent) to improve communication efficiency in multi-agent systems powered by large language models. This system treats communication…
FRONTIER RELEASE · CL_01024 · Aug 9 · 11:23

OpenAI launches affordable GPT-4o mini and open-weight gpt-oss models

OpenAI has released GPT-4o mini, a new, highly cost-efficient small model designed to broaden AI accessibility and application development. This model demonstrates superior performance on benchmarks like MMLU, MGSM, and…
COMMENTARY · CL_01323 · Dec 5 · 00:00

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

Current methods for evaluating large language models, such as MMLU and HumanEval, may be insufficient as they do not capture the nuances of interactive, goal-oriented conversations. A more effective approach would invol…