AI Accelerates Automation by Quantifying Economic Value and Reshaping Expertise - Episode Hero Image

AI Accelerates Automation by Quantifying Economic Value and Reshaping Expertise

Original Title:

TL;DR

  • AI models are improving at economically valuable tasks at a rate of 25-30% per year, significantly accelerating automation potential across industries.
  • Measuring AI success requires rubrics and real-world task evaluations, not just academic benchmarks, to align with actual economic value and customer needs.
  • Knowledge workers will increasingly transition from performing repetitive tasks to training AI agents and building reinforcement learning environments.
  • The future of AI development hinges on defining rubrics and high-quality data that capture nuanced human taste and complex, long-horizon tasks.
  • Expertise in niche domains, particularly those involving subjective judgment and uncodified knowledge, remains a critical bottleneck for AI capabilities.
  • AI-driven labor market matching will become more efficient by aggregating candidates and jobs, overcoming the current difficulty in assessing true job performance from profiles.
  • Future hiring will likely involve assessing candidates' ability to leverage AI tools effectively, rather than prohibiting their use, to gauge real-world impact.

Deep Dive

Brendan Foody, the youngest unicorn founder and guest on Conversations with Tyler, highlights a critical shift in AI development: moving beyond purely academic benchmarks to measure real-world economic value. His company, Mercor, facilitates this by employing domain experts to train and evaluate AI models, revealing an astonishing rate of improvement in economically valuable tasks, estimated at 25-30% annually. This progress suggests a rapid transformation of knowledge work, where human experts will increasingly focus on training AI agents to perform complex, long-horizon tasks, rather than executing them directly.

The implications of this accelerated AI capability are profound and multi-faceted. Firstly, the economic value created by AI is rapidly increasing. By surveying experts in fields like finance, law, and consulting, Mercor quantifies the time spent on various tasks, using this as a proxy for economic value. Models are already scoring highly on these evaluations, indicating their growing capacity to automate significant portions of high-value knowledge work. This implies a future where AI can handle a substantial percentage of tasks, but human expertise will remain crucial for the final, complex 25% that requires nuanced judgment and integration across tools and long time horizons.

Secondly, this advancement necessitates a re-evaluation of how human talent is utilized and assessed. The future of knowledge work will likely involve individuals training AI agents and building reinforcement learning environments, rather than performing repetitive analysis. This shift creates a demand for new job categories focused on AI training and oversight. Consequently, traditional hiring practices, often reliant on subjective "vibes" and interviews, are becoming obsolete. Mercor emphasizes data-driven assessments and project-based evaluations, recognizing that the ability to effectively utilize AI tools will be a key differentiator. This also suggests a potential resurgence of nepotism and networking as differentiators in a crowded field of seemingly qualified candidates, though the hope is that data-driven AI systems can mitigate this by objectively assessing performance.

Finally, the increasing capability of AI also raises questions about the nature of expertise and taste. While AI can excel at tasks with quantifiable rubrics, domains like poetry and law, which heavily involve subjective taste, present a greater challenge. Mercor's approach involves using both rubrics and Reinforcement Learning from Human Feedback (RLHF) to capture these nuances. This suggests that the future of AI development will involve not just raw intelligence, but also the ability to understand and adapt to evolving human preferences and aesthetic standards, potentially leading to highly personalized AI interactions across various domains. The ultimate impact will be a significant increase in economic efficiency and productivity, fundamentally reshaping how work is done and value is created.

Action Items

  • Create rubrics: Define 3-5 criteria for evaluating AI model performance in economically valuable tasks (e.g., medical diagnosis, legal drafting).
  • Measure model improvement: Track the percentage increase in AI model performance on economically valuable tasks year-over-year using expert-defined rubrics.
  • Develop long-horizon task evals: Design and implement assessments for AI models to measure their capability in multi-day or multi-week tasks, integrating multiple tools.
  • Identify niche expertise gaps: For 3-5 domains, analyze areas where human experts possess unique taste or undocumented knowledge that AI models currently struggle to replicate.
  • Build agent training environments: Create RL environments for 2-3 knowledge work verticals (e.g., investment analysis, customer support) to train AI agents for specific, long-horizon tasks.

Key Quotes

"when one of the ai labs wants to teach their models how to be better at poetry we'll find some of the best poets in the world that can help to measure success via creating evals and examples of how the model should behave and one of the reasons that we're able to pay so well to attract the best talent is that when we have these phenomenal poets that teach the models how to do things once they're then able to apply those skills and that knowledge across billions of users hence allowing us to pay 150 an hour for some of the best poets in the world"

Brendan Foody explains that Mercor hires top poets to train AI models in poetry. The high hourly rate is justified because these poets' expertise can be applied across a vast user base, making the investment scalable and profitable. This highlights Mercor's model of leveraging specialized human talent for AI development.


"the largest disconnect that we were seeing in ai research is that everyone was focused on academic evals like gpqa for phd level reasoning or imo for olympiad math which were wholly disconnected from the outcomes that customers actually care about of how do we get the model to automate a medical diagnosis or a legal draft or preparing a certain financial analysis of a company"

Brendan Foody points out a critical gap in AI research: a focus on theoretical benchmarks rather than practical, real-world applications. Foody's company, Mercor, aims to bridge this by using experts in fields like medicine, law, and finance to develop evaluations that reflect actual customer needs and economic value.


"the largest takeaway is the rate of model improvement at economically valuable tasks is incredible like if you look at the level that gpt4 scored on this model right a frontier model a year ago and that against gpt5 today the delta is profound and so it often gets can you put a number on that or somehow yeah call it a 25 30 improvement per year per year exactly"

Brendan Foody shares a key finding from their research: AI models are improving at an astonishing rate in tasks that have economic value. Foody quantifies this improvement at approximately 25-30% per year, indicating a rapid acceleration in AI's capability to automate complex professional work.


"the economic value is in a way still at zero well so it's interesting i think what you're getting at is there's sort of two key things the models struggle at that humans tend to be very good at the first is these longer horizon tasks of not just something that we could do in a few hours but something that might take us 50 or 100 hours to do and then the second thing is integrating multiple tools with our response and going about doing these things maybe interacting with people as one of those elements"

Brendan Foody identifies two significant limitations of current AI models: their difficulty with long-term, multi-stage tasks and their inability to seamlessly integrate multiple tools or interact with humans. Foody suggests these are areas where human expertise remains crucial, though he anticipates rapid advancements in these capabilities within the next six to twelve months.


"the best analog is a rubric if you have some a rubric for how to grade a rubric for how to grade so if you have here's like if the poem has you know evokes this idea that is inevitably going to come up in this prompt or is a characteristic of a really good response will you know reward the model a certain amount if it says this thing will penalize the model if it styles the response in this way we'll reward it those are the types of things in many ways very similar to the way that a professor might create a rubric to grade an essay or a poem"

Brendan Foody explains that the most valuable data for improving AI, particularly in subjective domains like poetry, is a well-defined rubric. This rubric acts as a grading system, outlining specific characteristics of good responses and providing clear criteria for AI to learn from, analogous to how a professor grades an essay.


"my firm belief is especially for economically valuable tasks we'll move towards a world where people do things once instead of the investment banker redundantly analyzing a data room to prepare an analysis of a company you know every couple of weeks for a new project or a new customer they'll teach the model how to do that once in the particular domains that they operate in and similar to building software once they'll be able to use that many times as they use their agent instead of the customer support rep monotonously responding to tickets every day and they'll find the mistake that the agent makes they'll turn that into an rl environment and then all of a sudden the agent will be able to solve that problem many times"

Brendan Foody predicts that economically valuable tasks will increasingly shift towards a model where humans teach AI systems once, and then those systems can perform the task repeatedly. This is akin to software development, where a fixed cost is invested upfront to create a reusable asset, thereby transforming knowledge work into a more efficient, agent-based system.

Resources

External Resources

Books

  • "Critique of Judgment" by Immanuel Kant - Mentioned in relation to the idea that taste cannot be captured in a rubric.

Articles & Papers

  • "AI-generated poems vs. human-generated poems" - Mentioned in the context of studies where humans prefer AI-generated poems, even if experts find them worse.

People

  • Larry Summers - Mentioned as an expert hired for finance and economics in relation to the AI productivity index.
  • Cass Sunstein - Mentioned as an expert hired for law in relation to the AI productivity index.
  • Eric Topol - Mentioned as an expert hired for medicine in relation to the AI productivity index.
  • Jeffrey Hill - Mentioned as a contemporary poet whose work the speaker enjoys.
  • William Wordsworth - Mentioned as an example of an older poet whose style might be modeled.
  • William Blake - Mentioned as an example of an older poet whose style might be modeled.
  • John Milton - Mentioned as an example of an older poet whose style might be modeled.
  • Rilke - Mentioned as an example of an older poet whose style might be modeled.
  • Scott Sumner - Mentioned in relation to the idea that the best movies were made in the 1960s and 70s.
  • Sandeep Jain - Mentioned as a deep domain expert in labor markets, formerly Chief Product Officer and Chief Technology Officer at Uber.
  • Peter Thiel - Mentioned in relation to the Thiel Fellowship and his potential role in interviewing candidates.

Organizations & Institutions

  • Mercatus Center - Producer of "Conversations with Tyler."
  • George Mason University - Affiliated with the Mercatus Center.
  • Mericor (Mercor) - An AI company founded by Brendan Foody.
  • Safeway - Mentioned as the place where Brendan Foody bought donuts for his eighth-grade business.
  • McKinsey - Mentioned as a top consulting firm whose experts were surveyed for the AI productivity index.
  • Bain - Mentioned as a top consulting firm whose experts were surveyed for the AI productivity index.
  • BCG - Mentioned as a top consulting firm whose experts were surveyed for the AI productivity index.
  • Pro Football Focus (PFF) - Mentioned as a data source for player grading in a previous example.
  • New England Patriots - Mentioned as an example team for performance analysis in a previous example.
  • NFL (National Football League) - Mentioned as the primary subject of sports discussion in a previous example.
  • Apple - Mentioned as a company with a strong brand around privacy.
  • Coca-Cola - Mentioned as an example of a large company where some roles are not directly AI-focused.
  • Amazon - Mentioned as a large company that conducts many interviews.
  • LinkedIn - Used as an analogy for a platform with distribution but difficulty in matching performance.
  • Uber - Mentioned in relation to Sandeep Jain's previous role.

Websites & Online Resources

  • merc.org - Website to learn more about the Mercatus Center.
  • conversationswithtyler.com - Website for transcripts and links from the podcast.

Other Resources

  • AI Productivity Index (Apex) - A project focused on measuring AI capabilities in relation to customer outcomes.
  • RLHF (Reinforcement Learning from Human Feedback) - A method for training models by having humans choose preferred responses.
  • Rubric - A tool used to measure success and grade performance, particularly for AI models.
  • Evals - Evaluations or tests used to measure AI model capabilities.
  • Policy Debate Team - Mentioned as the activity Brendan Foody and his co-founders participated in during high school.
  • National Extemporaneous Speaking - Mentioned as an activity Brendan Foody and his co-founders participated in during high school.
  • Thiel Fellowship - A program that provides funding for young entrepreneurs.
  • Belly - An app for food ratings in San Francisco.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.