PulseAugur
LIVE 01:27:21
ENTITY HumanEval

HumanEval

PulseAugur coverage of HumanEval — every cluster mentioning HumanEval across labs, papers, and developer communities, ranked by signal.

Total · 30d
15
15 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
13
13 over 90d
TIER MIX · 90D
SENTIMENT · 30D

2 day(s) with sentiment data

RECENT · PAGE 1/1 · 13 TOTAL
  1. RESEARCH · CL_28917 ·

    New RL method teaches LLMs to self-correct answers

    Researchers have developed SCoRe, a novel two-stage reinforcement learning technique that enables language models to refine their own responses using self-generated data. This method significantly improves performance o…

  2. TOOL · CL_27577 ·

    Neuroevolution framework boosts LLM output diversity via prompt embedding evolution

    Researchers have developed QD-LLM, a novel framework that uses parameter-efficient neuroevolution to enhance the diversity of outputs from large language models. This method evolves compact prompt embeddings, which act …

  3. SIGNIFICANT · CL_22783 ·

    OpenAI's GPT-5.5 prioritizes reliability for production AI agents over benchmarks

    OpenAI has released GPT-5.5, which reportedly excels not in benchmark scores but in practical reliability for complex tasks. The new model demonstrates significantly improved instruction following, reduced hallucination…

  4. COMMENTARY · CL_20705 ·

    AI models: Choose benchmarks over hype for true performance

    A recent analysis highlights that tech companies often select AI models based on hype rather than performance on relevant benchmarks. The article emphasizes that benchmarks like SWE-bench for coding, Terminal-Bench for …

  5. TOOL · CL_18865 ·

    ReCode framework enhances AI code generation by rewarding reasoning processes

    Researchers have developed ReCode, a novel reinforcement learning framework designed to improve code generation by focusing on the reasoning process. This framework uses Contrastive Reasoning-Process Reward Learning (CR…

  6. RESEARCH · CL_15893 ·

    MolViBench benchmark evaluates LLMs on molecular coding tasks for drug discovery

    Researchers have introduced MolViBench, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in molecular coding tasks. This benchmark addresses the gap left by existing evaluations, w…

  7. RESEARCH · CL_13613 ·

    Vintage AI trained on 1930s data learns to code and fix software bugs

    Researchers have fine-tuned a large language model, Talkie-1930-13B, trained only on data predating 1931, to perform software engineering tasks. Despite its limited knowledge base, the model successfully patched a bug i…

  8. RESEARCH · CL_11738 ·

    BoostLoRA method grows adapter rank to surpass full fine-tuning

    Researchers have introduced BoostLoRA, a novel parameter-efficient fine-tuning method designed to enhance model expressivity without increasing inference overhead. This technique iteratively trains and merges small adap…

  9. RESEARCH · CL_07050 ·

    Researchers generate verifiable code reasoning data to boost LLM performance

    Researchers have developed a new method to generate verifiable Chain-of-Thought (CoT) rationales for code reasoning by instrumenting code to capture execution traces. This pipeline narrates these traces into natural lan…

  10. RESEARCH · CL_06927 ·

    Think Anywhere in Code Generation

    Researchers have introduced "Think-Anywhere," a new reasoning mechanism for large language models that allows them to generate code by thinking at any point during the process, rather than just upfront. This approach ha…

  11. RESEARCH · CL_05211 ·

    Language agents use auction to cut communication costs and boost reasoning

    Researchers have developed a new framework called DALA (Dynamic Auction-based Language Agent) to improve communication efficiency in multi-agent systems powered by large language models. This system treats communication…

  12. FRONTIER RELEASE · CL_01024 ·

    OpenAI launches affordable GPT-4o mini and open-weight gpt-oss models

    OpenAI has released GPT-4o mini, a new, highly cost-efficient small model designed to broaden AI accessibility and application development. This model demonstrates superior performance on benchmarks like MMLU, MGSM, and…

  13. COMMENTARY · CL_01323 ·

    How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

    Current methods for evaluating large language models, such as MMLU and HumanEval, may be insufficient as they do not capture the nuances of interactive, goal-oriented conversations. A more effective approach would invol…