AI Accelerates Automation by Quantifying Economic Value and Reshaping Expertise - Episode Hero Image

AI Accelerates Automation by Quantifying Economic Value and Reshaping Expertise

Original Title: Brendan Foody on Teaching AI and the Future of Knowledge Work

The AI Training Revolution: Beyond Raw Text to Rubrics and Real-World Value

This conversation with Brendan Foody, co-founder of Mercor, reveals a critical, often overlooked, shift in the AI landscape: the move from simply feeding models vast amounts of text to meticulously defining and measuring success through expert-crafted rubrics. The non-obvious implication is that the true bottleneck in AI advancement is not computational power or raw data, but the human expertise required to teach models what good looks like, especially in economically valuable domains. This insight is crucial for anyone in tech, AI development, or business strategy who wants to understand the next frontier of AI utility and competitive advantage. By focusing on the quality of AI evaluation, this discussion offers a roadmap for building more impactful AI, a significant advantage for those who grasp its strategic importance.

The Hidden Cost of "Good Enough" AI: Why Rubrics Trump Raw Text

The prevailing narrative in AI development often centers on the sheer volume of data and the increasing sophistication of models. However, Brendan Foody, the remarkably young founder of Mercor, a company that sources experts to train frontier AI models, highlights a more nuanced and arguably more critical factor: the quality of evaluation. Mercor's business model, which includes paying top poets $150 an hour, underscores a fundamental truth: teaching AI to perform economically valuable tasks requires more than just data; it demands expert-defined standards of success. This is where the concept of "rubrics" becomes paramount, shifting the focus from mere output generation to the precise measurement of desired outcomes.

Foody argues that much of the AI research community has been fixated on academic benchmarks--like graduate-level reasoning tests or math olympiads--which are disconnected from real-world applications. The true challenge lies in evaluating AI's ability to, for instance, automate medical diagnoses or draft legal documents. Mercor’s approach, working with luminaries like Larry Summers for finance, Cass Sunstein for law, and Eric Topol for medicine, aims to bridge this gap. These experts are not just academics; they possess a broad, industry-wide vantage point crucial for designing effective evaluation frameworks.

The core insight here is that the rate of model improvement on economically valuable tasks is staggering, estimated at 25-30% per year. However, this progress is directly tied to the quality of the evaluations. Foody elaborates on the methodology: surveying hundreds of experts in fields like consulting to understand how they spend their time, then translating that into prompts and rubrics. This process quantifies the economic value of tasks, providing a tangible metric for AI progress.

"The largest disconnect that we were seeing in ai research is that everyone was focused on academic evals like gpqa for phd level reasoning or imo for olympiad math which were wholly disconnected from the outcomes that customers actually care about of how do we get the model to automate a medical diagnosis or a legal draft or preparing a certain financial analysis of a company."

-- Brendan Foody

This focus on rubrics reveals a systemic consequence: without them, AI development risks optimizing for the wrong objectives. A model might be excellent at generating grammatically correct poetry, but if it doesn't capture the nuanced aesthetic or emotional resonance that human readers value, its utility is limited. Foody emphasizes that while models are rapidly improving, the "last 25%"--the truly complex, nuanced, and taste-driven aspects of human expertise--remains a significant bottleneck. This is precisely where human experts, guided by well-defined rubrics, become indispensable. The implication for businesses is clear: investing in the development of robust evaluation frameworks, rather than solely in model scaling, will be a key differentiator.

The Long Horizon Payoff: From Task Automation to Agent Training

The conversation pivots to a more profound implication of AI advancement: the shift from automating individual tasks to enabling the training of sophisticated AI agents capable of complex, long-horizon work. Foody predicts that within six to twelve months, we will see models capable of extensive tool use and multi-day tasks. This capability fundamentally alters the nature of knowledge work.

Instead of repetitive analysis, knowledge workers will increasingly transition to training AI agents and building reinforcement learning (RL) environments. This represents a significant departure from conventional wisdom, which often focuses on AI replacing jobs. Foody's perspective is that AI will create new job categories centered around AI supervision and development.

"I think that a huge portion of the economy will become an rl environment machine."

-- Brendan Foody

This transition has profound implications for competitive advantage. Companies that can effectively train and deploy these agents will gain a significant edge. The analogy to software development is apt: initial investment in building robust agents and RL environments yields scalable, repeatable value. This is a delayed payoff, requiring upfront investment in expertise and infrastructure, but one that promises substantial long-term returns. The "discomfort now, advantage later" dynamic is evident here; the effort to define and build these training systems is demanding, but it unlocks unprecedented productivity. Conventional wisdom, focused on immediate task automation, fails to capture the strategic value of this longer-horizon investment.

The Taste Dilemma: Enshrining Preferences in a Changing World

A particularly fascinating thread in the discussion is the challenge of "taste" in AI. Foody acknowledges that taste, particularly in subjective domains like poetry or law, is difficult to capture in a rubric. Immanuel Kant's assertion that taste cannot be codified highlights this inherent tension. While RLHF (Reinforcement Learning from Human Feedback) offers a way to capture preferences, it raises questions about whose preferences should be enshrined. Should AI model the taste of historical masters like Milton or Wordsworth, or contemporary experts?

Foody suggests that in the long run, AI will likely be able to personalize taste, drawing on various historical and contemporary knowledge bases. However, this raises a critical question for businesses and developers: what taste are you optimizing for now? The choice of evaluators and the criteria used to define "good" will shape the AI's output and, consequently, its market impact.

This isn't just an academic debate; it has direct economic consequences. If AI is trained on a narrow definition of taste, it may fail to capture broader market appeal or alienate segments of users. Conversely, a model that can adapt to diverse preferences, guided by well-crafted rubrics and expert feedback, will possess a significant competitive advantage. The implication is that companies need to be deliberate about the "taste" they imbue in their AI, understanding that this choice has long-term strategic ramifications, even if it requires uncomfortable upfront decisions about whose expertise to prioritize.

Actionable Takeaways for Navigating the AI Frontier

Based on this conversation, here are key actions to consider:

  • Prioritize Rubric Development: For any AI initiative, invest heavily in defining clear, measurable rubrics for success, especially for economically valuable tasks. This is not just about data collection; it's about expert evaluation.
    • Immediate Action: Audit current AI projects for clear evaluation criteria.
  • Focus on Long-Horizon Capabilities: Shift strategic thinking beyond immediate task automation to developing AI agents capable of complex, multi-day projects.
    • This pays off in 12-18 months: Begin R&D into agent training and RL environment development.
  • Cultivate Domain Expertise: Recognize that the "last 25%" of AI performance relies on deep human expertise. Actively seek and integrate domain experts into your AI development and evaluation processes.
    • Over the next quarter: Identify and engage with key domain experts relevant to your industry.
  • Embrace the "Taste" Challenge: Be intentional about the aesthetic and qualitative standards you want your AI to embody. Understand that taste is subjective and evolving, and your choices will shape your AI's market reception.
    • This pays off in 18-24 months: Develop strategies for incorporating diverse and evolving taste preferences into AI training.
  • Invest in AI Training Roles: Anticipate the emergence of new job categories focused on training AI agents and building RL environments.
    • Immediate Action: Begin upskilling existing talent or hiring for roles focused on AI supervision and training.
  • Leverage AI for Hiring: Implement data-driven, skills-based assessments rather than relying on subjective "vibes" or traditional interview heuristics.
    • Over the next 6 months: Pilot project-based assessments for key hires.
  • Consider the "Discomfort Now, Advantage Later" Principle: Embrace initiatives that require upfront effort and may not show immediate results but build durable competitive moats. Developing robust AI evaluation and training systems falls squarely into this category.
    • Ongoing Investment: Allocate resources to long-term AI capabilities that competitors may shy away from due to their difficulty.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.