Prioritizing Systematic Evaluation Over Manual Engineering Construction

Original Title: How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How I AI · June 15, 2026 · Listen to Original Episode →

The biggest barrier to AI adoption in engineering is not model capability. It is the human refusal to shift from doing to defining. Ankur Goyal shows that the most valuable work for senior engineers is no longer writing code, but building the rigorous, automated feedback loops that allow agents to outperform manual effort. This requires moving from vibe-based development to systematic evaluation, effectively turning human expertise into code. For engineering leaders, the competitive advantage lies in treating CI/CD and evaluation pipelines as the primary product, not secondary support. Those who embrace this shift move from manual construction to high-level architectural carving, while those who cling to manual oversight will be outpaced by the volume of rigorous benchmarking their competitors can now execute.

The shift from construction to curation

The most common failure in AI-integrated engineering is the attempt to use agents as junior developers rather than as exhaustive researchers. Goyal argues that the agent line, the threshold of tasks we delegate, is rising. The reality is that AI can now perform more rigorous benchmarking than any single human engineer. While a human might run three benchmarks and guess the rest, an agent can run thousands of iterations across different database indexes and execution engines.

"There is no staff engineer who is running as many rigorous benchmarks and trying out different algorithms and analyzing ideas manually than someone who is using an agent."

-- Ankur Goyal

This creates a systemic advantage: you are no longer limited by human attention span, which decays during tedious, high-stakes tasks. By offloading these to background agents, engineering teams can explore architectural shifts that were previously considered too risky or time-consuming to validate.

Evals as the modern PRD

Traditional software development relies on Product Requirement Documents (PRDs) that are often interpreted loosely. In the AI era, this ambiguity is a liability. Goyal posits that an eval is simply a modernized, quantified PRD. By encoding what good looks like into a scoring function, you force clarity upon the system.

The hidden consequence of this approach is that it allows for the scaling of individual taste. When a senior engineer or designer encodes their aesthetic or technical standards into an eval, those standards are applied to every single output, rather than just the ones the expert has time to review.

"We are able to have David's palette applied to more things. I think the quality bar that we are able to hit is higher because we are able to get more things to that bar."

-- Ankur Goyal

This creates a compounding feedback loop: the system learns the expert preferences, the expert refines the system based on its failures, and the quality bar rises across the entire organization simultaneously.

The CI/CD bottleneck

Many teams struggle to integrate AI because their underlying infrastructure, specifically their CI/CD pipelines, is not built to handle the velocity that AI enables. Goyal suggests that if you feel constrained, the solution is not to push harder on feature development, but to pause and invest in CI.

This is an unpopular but durable investment. When CI is robust, it earns the team the right to move faster. Without it, you are simply shipping poor work at a higher frequency. The system responds to this investment by allowing for carving, the process of removing unnecessary features and complexity, rather than just constructing more code.

Key action items

Audit your agent line: Identify three recurring decisions or manual processes you perform weekly. Over the next quarter, build an agent-based workflow to handle these, forcing yourself to define the what rather than the how.
Invest in CI/CD infrastructure: Treat your CI/CD pipeline as your most critical product. If your deployment process is slow, AI-generated code will only create technical debt faster. This pays off in 6 to 12 months by enabling consistent, high-velocity shipping.
Build your first eval: Stop relying on vibe checks. Encode your current success criteria into a simple scoring function. Even if it feels imperfect, it creates a baseline for improvement that you can iterate on over the next month.
Adopt remote/cloud development: If your agent workflows are causing local hardware bottlenecks, move them to remote cloud environments. This is an immediate investment that prevents the noise and context-switching that kills flow states.
The reset protocol: When an agent produces 3,000 lines of trash, do not try to patch it. Close the session, manually write the core logic yourself to regain deep context, and then re-introduce the agent to handle the scale. This is a short-term discomfort that yields a higher-quality, more maintainable system.

Related Episodes

Engineers Must Understand AI's Probabilistic Nature for Real Productivity

Apr 29, 2026 Beyond Coding

AI amplifies code quality and productivity, but only when engineers understand its probabilistic nature and integrate it strategically, avoiding wasted potential and amplified existing problems.

View Episode Notes →

AI Accelerates Software Development--Re-evaluating Quality and Productivity

Mar 17, 2026 Practical AI

AI agents don't replace developers but fundamentally alter software creation velocity, demanding a re-evaluation of quality and productivity. Unlock unprecedented speed by embracing this shift.

View Episode Notes →

Shifting Engineering Hiring From Algorithmic Puzzles To System Oversight

Jun 16, 2026 Overcommitted | Software Engineering and Programming Insights

Algorithmic puzzles do not measure success in modern engineering because they reward pattern matching rather than system-level thinking. Shift your hiring focus toward verifying, debugging, and communication to identify engineers who thrive in AI-integrated environments.

View Episode Notes →

AI Amplifies System Design Skills, Not Replaces Engineers

Apr 01, 2026 Beyond Coding

AI code generation amplifies the need for strong system design and fundamental principles, creating a gap between engineers who can orchestrate AI and those who cannot.

View Episode Notes →

AI Redefines Software Engineering: From Code Writers to Architects

Feb 18, 2026 Beyond Coding

AI agents write code, challenging developers to become system architects. Master explicit design and AI literacy to gain a strategic advantage and avoid eroding fundamental programming skills.

View Episode Notes →

Intercom Doubles Engineering Velocity Through AI-Driven Workflow Reimagining

Apr 20, 2026 How I AI

AI is not just augmenting engineers; it's fundamentally re-architecting R&D workflows, doubling velocity by treating the engineering organization itself as a product.

View Episode Notes →