Prioritizing Systematic Evaluation Over Manual Engineering Construction

Original Title: How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

The biggest barrier to AI adoption in engineering is not model capability. It is the human refusal to shift from doing to defining. Ankur Goyal shows that the most valuable work for senior engineers is no longer writing code, but building the rigorous, automated feedback loops that allow agents to outperform manual effort. This requires moving from vibe-based development to systematic evaluation, effectively turning human expertise into code. For engineering leaders, the competitive advantage lies in treating CI/CD and evaluation pipelines as the primary product, not secondary support. Those who embrace this shift move from manual construction to high-level architectural carving, while those who cling to manual oversight will be outpaced by the volume of rigorous benchmarking their competitors can now execute.

The shift from construction to curation

The most common failure in AI-integrated engineering is the attempt to use agents as junior developers rather than as exhaustive researchers. Goyal argues that the agent line, the threshold of tasks we delegate, is rising. The reality is that AI can now perform more rigorous benchmarking than any single human engineer. While a human might run three benchmarks and guess the rest, an agent can run thousands of iterations across different database indexes and execution engines.

"There is no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who is using an agent."

-- Ankur Goyal

This creates a systemic advantage: you are no longer limited by human attention span, which decays during tedious, high-stakes tasks. By offloading these to background agents, engineering teams can explore architectural shifts that were previously considered too risky or time-consuming to validate.

Evals as the modern PRD

Traditional software development relies on Product Requirement Documents (PRDs) that are often interpreted loosely. In the AI era, this ambiguity is a liability. Goyal posits that an eval is simply a modernized, quantified PRD. By encoding what good looks like into a scoring function, you force clarity upon the system.

The hidden consequence of this approach is that it allows for the scaling of individual taste. When a senior engineer or designer encodes their aesthetic or technical standards into an eval, those standards are applied to every single output, rather than just the ones the expert has time to review.

"We are able to have David's palette applied to more things. I think the quality bar that we are able to hit is higher because we are able to get more things to that bar."

-- Ankur Goyal

This creates a compounding feedback loop: the system learns the expert preferences, the expert refines the system based on its failures, and the quality bar rises across the entire organization simultaneously.

The CI/CD bottleneck

Many teams struggle to integrate AI because their underlying infrastructure, specifically their CI/CD pipelines, is not built to handle the velocity that AI enables. Goyal suggests that if you feel constrained, the solution is not to push harder on feature development, but to pause and invest in CI.

This is an unpopular but durable investment. When CI is robust, it earns the team the right to move faster. Without it, you are simply shipping poor work at a higher frequency. The system responds to this investment by allowing for carving, the process of removing unnecessary features and complexity, rather than just constructing more code.

Key action items

  • Audit your agent line: Identify three recurring decisions or manual processes you perform weekly. Over the next quarter, build an agent-based workflow to handle these, forcing yourself to define the what rather than the how.
  • Invest in CI/CD infrastructure: Treat your CI/CD pipeline as your most critical product. If your deployment process is slow, AI-generated code will only create technical debt faster. This pays off in 6 to 12 months by enabling consistent, high-velocity shipping.
  • Build your first eval: Stop relying on vibe checks. Encode your current success criteria into a simple scoring function. Even if it feels imperfect, it creates a baseline for improvement that you can iterate on over the next month.
  • Adopt remote/cloud development: If your agent workflows are causing local hardware bottlenecks, move them to remote cloud environments. This is an immediate investment that prevents the noise and context-switching that kills flow states.
  • The reset protocol: When an agent produces 3,000 lines of trash, do not try to patch it. Close the session, manually write the core logic yourself to regain deep context, and then re-introduce the agent to handle the scale. This is a short-term discomfort that yields a higher-quality, more maintainable system.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.