Evolving AI Coding Benchmarks Toward Long-Horizon Development and Collaboration - Episode Hero Image

Evolving AI Coding Benchmarks Toward Long-Horizon Development and Collaboration

Original Title:

TL;DR

  • Code evaluation benchmarks like SWE-bench are evolving beyond simple unit tests to long-horizon development tournaments (CodeClash) where agents maintain and improve codebases over multiple rounds, simulating real-world software engineering challenges.
  • The proliferation of SWE-bench variants and specialized benchmarks (Tau-bench, Terminal-bench) indicates a shift towards evaluating AI coding agents on diverse tasks, languages, and environments, moving beyond initial Django-centric evaluations.
  • Including "impossible tasks" in benchmarks like Tau-bench is a feature, not a bug, as it helps flag potential cheating by AI agents and drives research into more robust evaluation methodologies.
  • Terminal-bench unlocks creativity by allowing non-coders to design diverse environments beyond GitHub issues, fostering innovation in AI coding agent task design and evaluation.
  • The academic research community faces a data challenge, needing compelling products or user simulators to gather rich user interaction data comparable to that held by commercial entities like Cognition and Cursor.
  • CodeClash aims to be a testbed for human-AI collaboration, allowing researchers to vary interaction setups (solo agent, multi-agent, human+agent) and measure how collaboration patterns change with model capabilities.
  • SWE-Efficiency benchmarks focus on optimizing code for speed without altering behavior, using techniques like parallelization and SIMD operations, addressing a critical aspect of performance engineering.

Deep Dive

The landscape of evaluating AI coding agents is rapidly expanding beyond simple task completion, driven by the need to assess long-horizon development, complex interactions, and real-world utility. As benchmarks like SWE-bench evolve, they are revealing critical tensions between standardized testing and the dynamic nature of software engineering, necessitating new paradigms for assessing AI capabilities in increasingly sophisticated development environments.

The initial success of SWE-bench, released in October 2023, was largely dormant until Cognition's Devin launch catalyzed an arms race in AI coding agents, prompting widespread adoption. This led to the proliferation of SWE-bench variants, including SWE-bench Verified, SWE-bench Multimodal, and SWE-bench Multilingual, which expanded the evaluation to nine languages across 40 repositories, moving beyond its original Django focus. However, the emergence of independent benchmarks like SWE-bench Pro, using the SWE-bench name without direct affiliation, highlights the decentralized nature of this field and the challenge of maintaining a singular standard.

A significant shift is occurring with the development of benchmarks like CodeClash, which moves beyond independent unit tests to evaluate long-horizon development. CodeClash pits AI agents against each other in programming tournaments where they maintain and improve their own codebases over multiple rounds. This approach is designed to assess consequential and conditional development, where an agent's actions in one round impact subsequent rounds. Initial arenas include programming games like Halite and are evolving towards economically valuable scenarios, aiming to capture more real-world utility than traditional issue-and-pull-request-based evaluations.

Beyond CodeClash, a "Cambrian explosion" of specialized code evaluation benchmarks is underway. SWE-Efficiency focuses on performance optimization without altering behavior, while others like AlgoTune, SciCode, Tau-bench, SecBench, and SRE-bench delve into scientific computing, security, and site reliability engineering. The controversy surrounding Tau-bench's inclusion of "impossible tasks" is viewed by some as a feature, acting as a flag for cheating if agents achieve unrealistically high scores. Terminal-bench, in particular, has unlocked creativity by allowing a broader range of users, including non-coders, to design evaluation environments beyond standard GitHub issue formats, signaling a move towards more diverse and user-driven testing grounds.

This evolution highlights a fundamental tension between long autonomy and interactivity in AI development. While benchmarks increasingly explore extended, unsupervised runs, some, like Cognition, emphasize rapid, back-and-forth interaction, arguing that real-world development is often underspecified and requires human-AI dialogue. This raises questions about the practical utility of purely autonomous agents versus those designed for collaborative workflows. The future of code evaluation likely lies in accommodating diverse interaction patterns, with benchmarks serving as testbeds for human-AI collaboration, varying parameters like model capability and collaboration setup to understand how interaction dynamics change.

The critical need for rich user interaction data, which companies possess and academics often lack, presents a significant challenge. Academic research must either build compelling products that attract consistent user engagement or develop sophisticated user simulators to gather comparable data. This data is crucial for understanding and benchmarking human-AI interaction, a key frontier in advancing AI's role in software engineering. The ultimate goal is to develop AI systems that can augment human capabilities, enabling tasks that neither humans nor AI could accomplish alone, by fostering deeper codebase understanding and more effective context engineering for large language models.

Action Items

  • Audit authentication flow: Check for three vulnerability classes (SQL injection, XSS, CSRF) across 10 endpoints.
  • Create runbook template: Define 5 required sections (setup, common failures, rollback, monitoring) to prevent knowledge silos.
  • Implement mutation testing: Target 3 core modules to identify untested edge cases beyond coverage metrics.
  • Profile build pipeline: Identify 5 slowest steps and establish 10-minute CI target to maintain fast feedback.

Key Quotes

"We put it out October 2023 and then people didn't really touch it too much and then of course like cognition came on the scene and devin was an amazing release and i think after that it kind of kicked off the arms race."

John Yang explains that SWE-bench, initially released in October 2023, gained significant traction only after the launch of Cognition's Devin, which he describes as an "amazing release." This event, according to Yang, catalyzed an "arms race" in the field of AI coding agents.


"I don't like unit tests as a form of verification and i also think there's an issue with sweetbench where all of the task instances are independent of each other so the moment you have the model kind of submit it it's done you know and that's the end of the story end of the episode you know so with code clash what we're thinking is let's try to really evaluate like long horizon development and uh development on a codebase that is consequential and conditional upon what a model did you know before to that codebase."

John Yang expresses dissatisfaction with unit tests as a verification method and notes that SWE-bench's independent task instances limit evaluation to single submissions. He introduces CodeClash as a solution to evaluate "long horizon development" where a model's current work is dependent on its previous actions within a consequential codebase.


"So for a kind of the initial release for scientific purposes we kind of use existing programming games the current ongoing effort is you know to build economically valuable arenas that's you know the popular word these days so for freelancer is a big one this year yeah gdp val awesome yeah just uh i mean i think the big selling point of terminal bench and sweetbench and these evals is that it was really close to real world utility and so i think it's resolvable for code clash and that's what we're working on."

John Yang clarifies that while CodeClash initially used programming games for scientific purposes, the current focus is on developing "economically valuable arenas." He believes this approach aligns with the goal of real-world utility, a key selling point of benchmarks like Terminal-bench and SWE-bench.


"For sure um so sweet efficiency was wrote by this phd student called jeffrey ma who happened to be my high school classmate and the idea there was like you take a codebase and you just want to you know do modifications that will literally make the code run faster so i think this is like parallelization simd operations stuff like that so so no no behavior change just faster exactly okay if the unit test is passing but i want better runtime yeah yeah."

John Yang highlights SWE-Efficiency, developed by his high school classmate Jeffrey Ma, as a benchmark focused on performance optimization. The goal of SWE-Efficiency, according to Yang, is to modify code to run faster through techniques like parallelization and SIMD operations, without altering its behavior or failing unit tests.


"I think generally we all agree that you know we'll improve on these things over time for evals so i actually really like benchmarks that intentionally i think we should intentionally include impossible tasks as a flag yeah of like hey you're cheating yes it's kind of sad that like uh cog actually is defending it because the master move would be like oh yeah you caught us like that that was you know like everyone reporting above 75 on talbench retail uh you're cheating yeah oh interesting that would be that would be cool yeah."

John Yang expresses a preference for benchmarks that intentionally include impossible tasks, viewing them as a valuable flag to detect cheating. He suggests that acknowledging and flagging such instances, rather than defending them, would be a more effective approach for benchmark integrity.


"I think the vision of like hey i tell it a goal i don't have to be super specific about my tasks i have like a decent verifier that proxies what i want something literally like a codebase that makes the most money in this like setting you know like that's my verifier you know and i walk away for five hours the thing is just running i'm hanging out with you talking to my friends i come back and it gives me like literally a soda codebase on on that you know task i think that would be super cool."

John Yang articulates a compelling vision for long-running AI agents where a user specifies a high-level goal, and the agent autonomously develops a codebase to achieve it. He imagines a scenario where the agent's success is verified by a proxy, such as a codebase that maximizes profit in a given setting, allowing the user to be hands-off during the process.

Resources

External Resources

Books

  • "Halite" by Michael Turl - Mentioned as a programmatic competition game that involves controlling fleets of ships.

Articles & Papers

  • "SWE-bench verified" - Mentioned as a benchmark that uses verified issues and pull requests.
  • "Impossible Bench" - Mentioned as a benchmark that modifies SWE-bench verified issues to make them impossible to test model refusals.
  • "Code Wiki" - Mentioned as a resource created by Silas for understanding codebases.

People

  • Othair - Mentioned as a prolific mentor in benchmarking and efficiency.
  • Carlos - Mentioned as a previous guest on the podcast.
  • Walden - Mentioned as someone who emailed about SWE-bench usage.
  • Andy - Mentioned as someone John Yang previously discussed Code Clash with.
  • Jeffrey Ma - Mentioned as the author of SWE-efficiency and a high school classmate of John Yang.
  • D Young - Mentioned as an advisor at Stanford who focuses on human-AI collaboration.
  • Shunyue - Mentioned as someone John Yang worked with closely at Princeton.
  • Karthik - Mentioned as someone John Yang worked with closely at Princeton and who posted a tweet defending SWE-bench.
  • Silas - Mentioned as the creator of the "Code Wiki".

Organizations & Institutions

  • Cognition - Mentioned in relation to the Devin release and its impact on SWE-bench usage, and for pushing codebase understanding work.
  • Jane Street - Mentioned in relation to the game Halite.
  • Stanford - Mentioned as the institution where D Young is an advisor.
  • Anthropic - Mentioned as a potential source of the "Impossible Bench" benchmark.
  • Google - Mentioned as having developed its own version of a "Code Wiki".

Websites & Online Resources

  • SWE-bench - Mentioned as a project that is one and a half years old and was released in October 2023.
  • Devin - Mentioned as an amazing release that kicked off an "arms race" in coding benchmarks.
  • SWE-bench Pro - Mentioned as an independent extension of SWE-bench.
  • SWE-bench Live - Mentioned as an extension of SWE-bench.
  • SWE-efficiency - Mentioned as a project focused on making code run faster through modifications like parallelization and SIMD operations.
  • Algotune - Mentioned as a project in line with SWE-efficiency.
  • Cyphercode - Mentioned as a project in the scientific coding domain, described as "HumanEval but better."
  • Cyphercode 2 - Mentioned as an impressive project.
  • Meter - Mentioned as a platform that uses SWE-bench and has interesting data on runtime and completion.
  • Terminal bench - Mentioned as a project with creative potential for benchmarking, with a 2.0 job that was excellent.
  • Critical point - Mentioned as a new benchmark related to physics.
  • SEC-bench - Mentioned as a benchmark related to cybersecurity.
  • SRE bench - Mentioned as a benchmark affiliated with load.
  • Towel - Mentioned in relation to user simulator stuff.
  • Towel bench - Mentioned as a benchmark with discussions about impossible tasks and underspecification.
  • Vending bench - Mentioned in relation to user simulator stuff.
  • Elm Arena - Mentioned as a compelling product for consistent user interaction.

Other Resources

  • Multimodal SWE-bench - Mentioned as an extension of SWE-bench.
  • Multilingual SWE-bench - Mentioned as an extension of SWE-bench, including languages like JavaScript, Rust, Java, C, and Ruby.
  • Code Clash - Mentioned as a project evaluating long-horizon development on consequential and conditional codebases, involving programming tournaments between language models.
  • Unit tests - Mentioned as a form of verification that John Yang dislikes.
  • Long horizon development - Mentioned as a focus of Code Clash.
  • Consequential and conditional codebase development - Mentioned as a focus of Code Clash.
  • Programming tournament - Mentioned as the format for Code Clash.
  • LLM judge - Mentioned as one of the mechanisms used in Code Clash.
  • Economically valuable arenas - Mentioned as the focus for current efforts in Code Clash.
  • Freelancer - Mentioned as a big economic arena for Code Clash.
  • GDP val - Mentioned as an economic arena for Code Clash.
  • Real-world utility - Mentioned as a selling point of Terminal bench and SWE-bench.
  • Performance optimization - Mentioned as a focus of SWE-efficiency.
  • Parallelization - Mentioned as a modification technique in SWE-efficiency.
  • SIMD operations - Mentioned as a modification technique in SWE-efficiency.
  • Scientific coding domain - Mentioned as the focus for Cyphercode.
  • HumanEval - Mentioned as a benchmark that Cyphercode is compared to.
  • Completions benchmarks - Mentioned as benchmarks that can be done well before graduating to multi-turn benchmarks.
  • Human hours worked - Mentioned as a metric used by Meter.
  • Runtime - Mentioned as a metric used by Meter.
  • Completion - Mentioned as a metric used by Meter.
  • Physics - Mentioned as a domain related to the "Critical point" benchmark.
  • Cybersecurity - Mentioned as a domain related to the "SEC-bench" benchmark.
  • User simulator - Mentioned in relation to Towel and Towel bench.
  • Human-AI collaboration - Mentioned as a focus of D Young at Stanford and a potential area for future work.
  • Environments for code beyond code - Mentioned as something companies like Murkor are focusing on.
  • Work gym style stuff - Mentioned as a potential area for future development.
  • Long running SWE agent - Mentioned as a compelling personal interest for future development.
  • Interactivity - Mentioned as an emphasis by Cognition, contrasting with long autonomy.
  • Long autonomy - Mentioned as a trend in coding evaluations, with questions about its material impact on the industry.
  • Levels of abstraction - Mentioned as something to enable for the diverse developer ecosystem.
  • General data processing - Mentioned as a task where autonomy might be preferred.
  • JSON parsing - Mentioned as an example of a task for autonomous processing.
  • User incurred action data - Mentioned as fascinating data from an academic standpoint.
  • Scaling up human-AI interaction evaluation - Mentioned as a challenge and area for inspiration.
  • Multi-agent collaboration - Mentioned as a potential future framing for Code Clash.
  • Human and agent collaboration - Mentioned as a potential future framing for Code Clash.
  • Model capability - Mentioned as a factor influencing human-AI interaction.
  • Codebase understanding - Mentioned as a focus for Cognition, involving codebase retrieval.
  • Codebase retrieval - Mentioned as part of codebase understanding.
  • Automatic context engineering for an LLM - Mentioned as a research sub-agent being worked on.
  • Codebase understanding benchmark - Mentioned as a difficult benchmark to create due to saturation issues.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.