AI's True Value Lies in Engineering, Not Just Models - Episode Hero Image

AI's True Value Lies in Engineering, Not Just Models

Original Title: Evals, Feedback Loops, and the Engineering That Makes AI Work
AI + a16z · · Listen to Original Episode →

The AI industry is captivated by the allure of ever-smarter models, often mistaking raw power for true progress. This conversation with Ankur Goyal, founder and CEO of Braintrust, reveals a critical, often overlooked truth: the real competitive advantage in AI lies not in the models themselves, but in the robust engineering that surrounds them. Goyal argues that the current "brute force" approach--throwing more compute and data at training--is economically viable only as long as capital is abundant. The non-obvious implication? Those who master the engineering of feedback loops, rigorous evaluation (evals), and adaptable systems will capture significant, lasting value when brute force hits its limits. This analysis is crucial for product leaders, engineering managers, and strategists aiming to build AI products that not only function but excel, offering a distinct edge over competitors fixated solely on model iteration.

The Brute Force Trap: Why Smarter Models Aren't Enough

The AI landscape is currently dominated by a "brute force" mentality. Frontier labs, fueled by abundant capital, are essentially "raising money and building a model based on the money." This approach, which involves throwing more compute and data at training runs, bypasses the painstaking work of optimization and engineering. Ankur Goyal points out that this strategy is sustainable only as long as capital remains readily available. The critical insight here is that this reliance on brute force creates a hidden vulnerability. When the ability to simply "make God 1 smarter" through capital infusion hits a ceiling, the true value will shift dramatically to efficiency and engineering.

"Right now, we're kind of building God, and so it's possible and probably economically viable to keep throwing capital at the problem to make God 1 smarter. But when you can't make God 1 smarter, there is an insane opportunity to engineer God to be more efficient."

This highlights a fundamental misunderstanding of where value creation will ultimately reside. The focus on the "bundle of weights" -- the model itself -- distracts from the "shit ton of engineering" required to make AI truly effective and reliable. This includes the intricate processes of capturing, cleaning, preparing, and distributing data, as well as the development of robust testing and feedback mechanisms. Companies that prioritize this surrounding engineering, rather than solely chasing the next state-of-the-art model, are building a more durable advantage.

Evals as the Scientific Method for AI: Taming Non-Determinism

A recurring theme is the concept of "evals," which Goyal defines as the application of the scientific method to non-deterministic systems like AI. This is not just a technical term; it represents a fundamental shift in how AI products should be developed and validated. Instead of blindly adopting new models or tweaking prompts, teams must form hypotheses, simulate system behavior on diverse inputs, and quantitatively measure outcomes. This rigorous process is essential for managing the inherent unpredictability of AI.

The implication for product development is profound. Goyal suggests that evals are the natural evolution of a Product Requirements Document (PRD). By creating a truly effective eval, a team is essentially creating a declarative representation of what their product should be. This moves beyond subjective assessments to objective, data-driven validation. The danger, as seen in the "bitter lesson" narrative, is when teams fail to engineer these feedback loops. They build systems that are designed to be thrown away, yet they become so deeply entrenched that they are impossible to iterate on.

"I think evals are like the scientific method applied to software engineering with non-deterministic systems, like AI systems. So you come up with a hypothesis... and then you essentially simulate running the system on a set of inputs and you observe the outputs."

This disciplined approach to evaluation provides a crucial defense against the inherent chaos of AI. Companies that invest heavily in well-engineered testing harnesses, even if the "innards of the system" are intentionally left unoptimized, gain a significant edge. They can precisely understand where their AI works and where it fails, allowing for targeted improvements rather than broad, inefficient model upgrades.

The Bash vs. SQL Divide: Engineering for Structure Over Brute Force

The conversation around agents and their interaction with tools reveals a critical dichotomy: the "give it a computer" brute-force approach versus a more structured, computer science-informed method. Many developers are drawn to giving agents access to Unix environments and Bash commands, assuming models are inherently good at this. Goyal argues this is a flawed engineering mindset.

His team's benchmark comparing Bash and SQL for agent tasks yielded "comical" results: SQL consistently outperformed Bash in accuracy, efficiency, and speed, even for less capable models. This suggests that models are not inherently better at manipulating files via Bash; rather, their performance is dictated by the structure and accessibility of the data. SQL, when applied to structured data, provides a more precise and efficient mechanism for querying and manipulating information than a series of Bash commands.

"SQL is more accurate, it's more efficient, it's more token efficient, it's faster. The worst models perform better on SQL than they do on like everything."

This points to a significant opportunity for competitive advantage. Teams that embrace structured tools and computer science fundamentals--like strong typing and referential transparency--are not underestimating the models; they are providing them with the right environment to excel. This "golden agent CS world" prioritizes reliability and safety by leveraging well-understood principles, rather than relying on the chaotic flexibility of a Unix shell. The ability to define clear type specifications, as seen in Braintrust's type_specs folder, allows for rigorous debate and a higher degree of confidence in implementation, moving beyond the "vibe coding" and comprehension debt often associated with less structured approaches.

The Token Path and the Illusion of Growth

A subtle but critical point emerges regarding the economics of AI: the "token path." Many AI companies, particularly those building on top of foundation models, operate by selling tokens, essentially reselling compute at a margin. This creates a powerful incentive for rapid growth, as top-line revenue can be easily juiced by selling tokens at cost, even if it means sacrificing long-term profitability or margin. Founders and investors are trained to prioritize this growth, leading to a dynamic where companies essentially "leak money" by selling services at cost.

Goyal’s approach at Braintrust, inspired by Datadog’s framework, is to align value with token consumption without directly reselling tokens. Their pricing is based on gigabytes ingested, a metric that closely mirrors token usage (a token is roughly four bytes). This strategy aims for "value alignment," ensuring that customers perceive value on the order of their consumption, while also reflecting the company's underlying costs. This is a difficult but necessary pivot. Companies that can successfully navigate this shift, demonstrating value beyond simply reselling tokens, will build more sustainable businesses. The temptation of the token path, while offering immediate growth, ultimately leads to a race to the bottom on margins, a trap that sophisticated engineering and thoughtful pricing models can help avoid.

Key Action Items

  • Develop a Rigorous Eval Framework: Implement a systematic process for evaluating AI model performance, treating evals as the scientific method for AI. This involves defining hypotheses, simulating inputs, and quantitatively measuring outputs. (Immediate Action)
  • Prioritize Engineering Around Models: Shift focus from solely chasing the latest models to building robust engineering around them, including data pipelines, testing harnesses, and feedback loops. (Immediate Action)
  • Embrace Structured Tools for Agents: Move beyond the "give it a computer" mentality for agents. Invest in and benchmark structured tools like SQL, which offer greater accuracy and efficiency than generic Unix environments. (Immediate Action, pays off in 3-6 months)
  • Define Clear Value Metrics: Align pricing and product strategy with customer value, even if it means deviating from the direct token-reselling model. Explore metrics that reflect consumption and cost, such as data ingestion volume. (Longer-term investment, pays off in 6-12 months)
  • Invest in Adaptable Systems: Build AI systems with the principle of "bitter lesson-pilling" in mind, creating architectures that can be easily iterated upon or thrown away as new models and techniques emerge. (Immediate Action, creates long-term flexibility)
  • Formalize System Specifications: Utilize declarative specifications, such as type systems and state management definitions, to guide AI development and ensure reliability, rather than relying on ad-hoc scripting. (This requires discomfort now, but creates advantage in 12-18 months for maintainability and robustness)
  • Analyze Competitor Capital Flows: Understand how readily available capital influences competitors' strategies, recognizing that a reliance on brute-force scaling may become a vulnerability when capital tightens. (Ongoing Strategic Analysis)

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.