Human-Like Learning, Not RL, Drives Future AI Progress
TL;DR
- Reinforcement learning atop LLMs is likely doomed if human-like learners are imminent, as current training on verifiable outcomes becomes pointless if models learn self-directedly on the job.
- The substantial investment in expert-written questions and answers for LLM benchmarks highlights that current model progress is heavily reliant on human-provided data, not inherent learning.
- Robotics remains challenging not due to hardware, but the absence of human-like learners, necessitating extensive practice for simple tasks like dish pickup instead of immediate utility.
- Current AI labs' actions, like pre-baking specific skills, suggest a belief that models will continue to struggle with generalization and on-the-job learning, requiring explicit skill injection.
- The slow diffusion of AI value outside coding indicates models currently lack broad economic utility, contradicting the idea that technology diffusion lag is the primary bottleneck.
- The concept of a "software-only singularity" or AI improving hardware overlooks continual learning as the primary driver of future AI improvements, mirroring human skill acquisition.
- Initial breakthroughs in continual learning will likely be incremental, with human-level on-the-job learning taking years to mature, preventing a sudden, decisive AI advantage.
Deep Dive
The current approach of scaling reinforcement learning (RL) atop large language models (LLMs) is fundamentally flawed if these models are nearing human-like learning capabilities. This reliance on pre-baking skills through extensive RL environments, like navigating browsers or using spreadsheets, is either redundant if models quickly learn on the job, or insufficient if they do not, suggesting that true AGI is not imminent. The significant investment in PhDs and experts to generate training data and scenarios for current models highlights that advanced capabilities, not just scale, are the drivers of benchmark improvements, and that human-like learning from experience remains a crucial, unsolved problem, particularly in areas like robotics.
The rationale behind investing heavily in current RL training methods is predicated on the assumption that models will continue to struggle with generalization and on-the-job learning. While it is efficient to bake in fluency with common tools once, this overlooks the vast amount of company and context-specific skills required for most jobs, for which there is no robust, efficient AI learning mechanism. The biologist's example of AI struggling to distinguish macrophages from similar-looking dots on slides, despite AI researchers claiming image classification is a solved problem, illustrates this gap. Human workers are valuable precisely because they do not require custom training pipelines for every micro-task; they learn from semantic feedback and self-directed experience, generalizing across diverse situations requiring judgment and situational awareness. Automating even a single job, let alone all jobs, requires more than a predefined set of skills; it necessitates AI that can learn and adapt like humans.
The argument that AI diffusion is slow due to technology adoption lags is unconvincing; instead, it suggests that current models lack the necessary capabilities for broad economic value. If AI were truly human-like, its integration and onboarding would be far faster and cheaper than hiring humans, avoiding the "lemons market" of human hiring where quality is difficult to ascertain beforehand. The current revenue generated by AI labs is orders of magnitude lower than the potential trillions implied by true AGI, indicating that previous definitions of AGI were too narrow. Future progress will likely involve continual learning, analogous to how humans improve through experience, rather than a single breakthrough. While initial progress in continual learning is expected, achieving human-level on-the-job learning will likely take years, and competition among AI labs will remain fierce, preventing any single entity from achieving a runaway advantage.
Action Items
- Audit RL training pipelines: Identify 3-5 critical skills pre-baked into current models and assess their necessity for generalization.
- Design continual learning framework: Define 5 key components for agents to acquire and share context-specific job skills from experience.
- Measure AI-human skill acquisition gap: For 3-5 job types, quantify the difference in learning time and resources between AI and human workers.
- Evaluate RL compute needs: Calculate the potential compute scale-up required for RL to match pre-training gains, referencing (Toby Board's post).
- Draft runbook for AI agents: Outline 5 sections for deploying AI agents, focusing on context-specific skill acquisition and knowledge sharing.
Key Quotes
"Now currently the labs are trying to bake in a bunch of skills into these models through mid training there's an entire supply chain of companies that are building rl environments which teach the model how to navigate a web browser or use excel to build financial models now either these models will soon learn on the job in a self directed way which will make all this pre baking pointless or they won't which means agi is not imminent"
The author presents a dichotomy regarding current AI training methods. This quote highlights the tension between pre-baking skills into models and the potential for models to learn these skills independently on the job. The author suggests that if models can learn independently, current extensive pre-training efforts will become obsolete, implying that AGI is not as imminent as some predict.
"When we see frontier models improving at various benchmarks we should think not just about the increased scale and the clever ml research ideas but the billions of dollars that are paid to phds mds and other experts to write questions and provide example answers and reasoning targeting these precise capabilities"
Barron suggests that progress in AI benchmarks is not solely due to technical advancements like increased scale or novel research. Instead, the author points to the significant financial investment in human experts who create the data and provide the feedback necessary to train these advanced models. This highlights the substantial human labor and cost involved in current AI development.
"This just gives me the vibes of that old joke we're losing money on every sale but we'll make it up in volume somehow this automated researcher is going to figure out the algorithm for agi which is a problem that humans have been banging their head against for the better half of a century while not having the basic learning capabilities that children have"
The author uses a humorous analogy to express skepticism about a specific AI development strategy. This quote critiques the idea that an automated AI researcher, lacking fundamental learning capabilities, can solve the complex problem of AGI, which has eluded human experts for decades. The author finds this approach implausible.
"Human workers are valuable precisely because we don't need to build in these schleppy training bloops for every single small part of their job it's not net productive to build a custom training pipeline to identify what macrophages look like given the specific way that this lab prepares slides and then another training loop for the next lab specific micro task and so on"
The author emphasizes the inherent value of human workers by contrasting their adaptability with current AI training limitations. This quote argues that humans possess a generalized learning ability that makes them efficient, unlike AI systems that require specialized, labor-intensive training for each specific task. The author finds it impractical to create custom training pipelines for every micro-task an AI might perform.
"People are really under rating how much company and context specific skills are required to do most jobs and there just isn't currently a robust efficient way for ais to pick up these skills"
The author argues that current AI systems are not adequately equipped to handle the nuances of real-world job requirements. This quote points out that most jobs demand highly specific, context-dependent skills that AI currently struggles to acquire efficiently. The author believes this limitation is underestimated.
"The reason that labs are orders of magnitude off this figure right now is that the models are nowhere near as capable as human knowledge workers"
The author directly addresses the revenue gap between AI models and human knowledge workers. This quote asserts that the primary reason for this discrepancy is the current capability deficit of AI models compared to humans. The author implies that significant improvements in AI capabilities are necessary to achieve economic parity with human labor.
Resources
External Resources
Books
- "Language Models are Few-Shot Learners" - Mentioned as the title of the GPT-3 paper, demonstrating the power of in-context learning.
Articles & Papers
- "Thoughts on AI progress (Dec 2025)" (Dwarkesh Podcast) - The essay on which this podcast episode is based.
- "O series benchmarks" - Used to infer that a significant scale-up in RL compute is needed for a boost similar to GPT.
People
- Barron Millage - Made an interesting point about the cost of training frontier models and suggested a future of continual learning agents.
- Toby Board - Authored a post that connects O series benchmarks to suggest a large scale-up in RL compute is necessary.
- Karpathy - Mentioned in relation to the concept of a "cognitive core" within specialized agents.
- Satya - Quoted regarding the potential impact of fully solved continual learning.
Websites & Online Resources
- dwarkesh.com - The blog where the original essay was published and where more essays are being published.
- dwarkesh.com/subscribe - Provided as a URL for full access to Dwarkesh Podcast.
Other Resources
- Reinforcement Learning (RL) - Discussed as an approach to training models, with questions about its effectiveness and the supply chain of companies building RL environments.
- Artificial General Intelligence (AGI) - The concept of human-like intelligence in machines, discussed in relation to its imminence and the capabilities required.
- Continual Learning - Presented as a potential main driver of future improvements beyond AGI, with an analogy to how humans become more capable through experience.
- In-context learning - Demonstrated by GPT-3, it is a capability that is still being progressed.
- Pre-training - Described as a scaling method with a clean and general trend in improvement, contrasted with RL from verifiable reward.