NDI Initiative: Program Synthesis Over LLM Scaling for True Intelligence

Original Title: How François Chollet Is Building A New Path To AGI

François Chollet's NDI initiative represents a critical pivot in AI research, moving beyond the current deep learning paradigm to explore foundational principles of intelligence. This conversation reveals the non-obvious implication that the relentless pursuit of scaling existing LLM architectures might be a less efficient, and perhaps ultimately limited, path toward Artificial General Intelligence (AGI). Instead, Chollet advocates for a return to first principles, focusing on program synthesis and symbolic reasoning to build AI that is not only capable but also optimally efficient. This exploration is crucial for researchers, engineers, and strategists who want to understand the potential limitations of current AI trends and identify the next frontier in building truly intelligent systems. By understanding Chollet's approach, readers gain an advantage in anticipating future AI capabilities and developing more robust, adaptable AI solutions.

The Unseen Wall: Why Scaling LLMs Might Not Lead to True Intelligence

The current AI landscape is dominated by the impressive capabilities of Large Language Models (LLMs), fueled by massive datasets and computational power. However, François Chollet, through his work at NDI (New Directions in AI) and the development of the ARC benchmark, argues that this dominant paradigm, while yielding remarkable results in specific domains, may be hitting fundamental limits. The core of his argument lies in the distinction between optimizing current systems and building truly intelligent, adaptable agents.

Chollet’s vision for NDI is to forge a new branch of machine learning, one that prioritizes optimality and efficiency over sheer scale. This involves a radical departure from the parametric, gradient-descent-based approach of deep learning. Instead, NDI focuses on program synthesis, aiming to create the smallest possible symbolic models that explain data. This approach, he posits, is inherently more aligned with the principles of intelligence, requiring less data and running more efficiently.

"We're actually trying to rebuild the whole stack onto different foundations. So we're building a new learning substrate that's very different from, you know, parametric learning, deep learning."

-- François Chollet

The immediate success of coding agents, which leverage LLMs to generate and verify code, highlights a critical insight: verifiable reward signals are the key to unlocking current AI capabilities. Code, mathematics, and similar domains provide a clear, objective measure of success. This allows for extensive post-training, often through reinforcement learning loops, where models can iteratively improve by generating solutions, testing them, and refining their approach. This is precisely how coding agents have achieved such impressive product-market fit. However, Chollet cautions that this success is domain-specific and relies on the verifiability of the output, not necessarily on a deeper, more generalized form of intelligence.

The implication here is profound: while LLMs can become incredibly proficient at tasks with clear, verifiable outcomes, their ability to generalize to less structured, non-verifiable domains--like creative writing or complex problem-solving without predefined metrics--may be inherently limited by their reliance on vast training data and their inability to perform true symbolic reasoning. This is where conventional wisdom, which suggests simply scaling up LLMs further, begins to falter. Chollet suggests that while LLMs might eventually contribute to AGI, they are not the optimal foundation.

The Efficiency Imperative: Beyond Brute Force Intelligence

Chollet's definition of AGI centers on skill acquisition efficiency, mirroring human-level learning speed and adaptability across arbitrary tasks. This contrasts sharply with definitions focused solely on automating economically relevant tasks. He argues that current LLM-based approaches, while capable of automation, are fundamentally inefficient. The immense computational resources and data required to train these models stand in stark contrast to how humans learn.

"General intelligence is human-level skill acquisition efficiency on the same scope of tasks that humans could potentially learn to do."

-- François Chollet

This pursuit of efficiency is the driving force behind NDI's exploration of "symbolic descent," an analogue to gradient descent but operating in the symbolic space. The goal is to discover the most concise symbolic models, adhering to the minimum description length principle, which suggests that shorter models are more likely to generalize. This focus on conciseness and optimality is what Chollet believes will unlock true general intelligence, rather than the "brute force" approach of scaling up parametric models. The ARC benchmark series, particularly V3, is designed to test this very efficiency, measuring an agent's ability to explore, learn, and adapt in novel environments with human-like speed.

The ARC Benchmark: A Barometer for Fluid Intelligence

The evolution of the ARC (Abstraction and Reasoning Corpus) benchmark series serves as a compelling narrative of AI progress and its limitations. ARC V1 and V2, while challenging, were eventually "saturated" by advanced reasoning and reinforcement learning techniques, respectively. This saturation, however, revealed a crucial distinction: the models weren't necessarily becoming "smarter" in a general sense, but rather more adept at exploiting specific training paradigms and verifiable reward signals.

"The progress you saw on Arc V2 is actually results of this very, very large scale targeting. So what you can do to solve Arc V2 is you ask your reasoning model to make more tasks like the ones in the benchmark. And then you try to solve them using, let's say, program induction for instance, still using your reasoning model. Then you verify its solution. Again, it's very favorable. So you can, you can trust the answer. And then you fine tune the model on the successful reasoning chains. And then you keep repeating."

-- François Chollet

ARC V3 represents a significant leap, designed to measure agentic intelligence--the ability to actively explore, set goals, and plan in novel, interactive environments without explicit instructions. Unlike its predecessors, V3 is resistant to the brute-force targeting strategies that saturated V2. It emphasizes fluid intelligence, the capacity to adapt and reason in entirely new situations, rather than simply mastering a known problem space through extensive post-training. This shift is critical because it moves the evaluation away from optimizing for a specific benchmark and towards assessing a more fundamental, human-like ability to learn and adapt. The creation of over 250 unique games for V3, built on core knowledge priors rather than borrowed game concepts, underscores this commitment to testing raw intelligence and exploration efficiency.

Key Action Items

  • Immediate Actions (Next 1-3 Months):

    • Explore NDI's research: Familiarize yourself with the core concepts of program synthesis and symbolic learning presented by François Chollet.
    • Study the ARC benchmark: Understand the design philosophy and evaluation criteria of ARC V1, V2, and especially V3 to grasp what constitutes "agentic intelligence."
    • Analyze coding agent successes: Deconstruct why coding agents are effective--focus on the role of verifiable reward signals and post-training.
    • Investigate alternative AI paradigms: Look beyond LLMs for research in areas like symbolic AI, genetic algorithms, and other approaches that prioritize efficiency and fundamental reasoning.
    • Assess current AI limitations: Critically evaluate where current LLM-based systems struggle, particularly in non-verifiable domains and tasks requiring novel problem-solving.
  • Longer-Term Investments (6-18 Months & Beyond):

    • Develop program synthesis capabilities: If building AI systems, consider integrating program synthesis techniques to create more efficient and adaptable models.
    • Focus on efficiency metrics: Shift evaluation from raw performance on known tasks to metrics of sample efficiency and adaptability in novel environments.
    • Investigate "harness" development: Understand how structured "harnesses" can enable LLMs to tackle complex problems, but recognize this as an augmentation, not a replacement for fundamental intelligence.
    • Prioritize foundational research: For organizations or researchers, consider dedicating resources to exploring alternative AI architectures that move beyond scaling current LLMs, focusing on first principles.
    • Build for compounding advantage: Develop AI systems that can improve their own learning and problem-solving capabilities over time, creating a self-reinforcing cycle of intelligence.
    • Embrace discomfort for future gains: Recognize that NDI's approach requires significant foundational work and may not yield immediate, visible results like LLM scaling. This long-term investment, though currently uncomfortable for many, could lead to a durable competitive advantage in future AI capabilities.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.