PulseAugur
LIVE 06:49:33
research · [2 sources] ·
0
research

ProgramBench benchmark finds language models struggle to build software from scratch

Researchers have introduced ProgramBench, a new benchmark designed to evaluate the holistic software development capabilities of language models. The benchmark challenges AI agents to architect and implement entire codebases from scratch, given only a program's documentation. Across 200 tasks, including implementing software like FFmpeg and SQLite, none of the nine evaluated language models could fully complete any task, with the best model passing only 3% of tests on average. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights current limitations of LLMs in complex software engineering tasks, suggesting further research is needed for autonomous code generation.

RANK_REASON This is a research paper introducing a new benchmark for evaluating language models in software development.

Read on arXiv cs.AI →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 · John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press ·

    ProgramBench: Can Language Models Rebuild Programs From Scratch?

    arXiv:2605.03546v1 Announce Type: cross Abstract: Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such set…

  2. arXiv cs.AI TIER_1 · Ofir Press ·

    ProgramBench: Can Language Models Rebuild Programs From Scratch?

    Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software a…