AI's Missing Dimension: World Models for 3D Spatial Intelligence

AI + a16z · December 23, 2025 · Listen to Original Episode →

Original Title:

TL;DR

AI systems require "world models" for spatial intelligence, moving beyond language to understand and act within the 3D physical world, which is fundamental for general intelligence.
Spatial intelligence, crucial for animal survival and human civilization, is a more ancient and fundamental aspect of intelligence than language, which is a recent evolutionary development.
Current AI, dominated by language models, overlooks the critical role of 3D space, which is essential for applications like robotics, creativity, and computational interfaces.
Building world models enables the creation of "infinite universes" for diverse applications, fostering a "multiverse way" of living and expanding imagination beyond current limitations.
3D spatial understanding is necessary for AI to navigate and interact with the physical world, as physics and interactions fundamentally occur in three dimensions.
The development of world models is inspired by breakthroughs in generative AI and computer vision techniques like NeRFs and Gaussian splatting, but requires a concentrated, industry-grade effort.

Deep Dive

The prevailing focus on language models in AI overlooks a more fundamental aspect of intelligence: spatial understanding of the 3D physical world. While language models excel at processing and generating text, they are insufficient for tasks requiring true comprehension and interaction with the physical environment, necessitating a shift towards "world models" that perceive and act in three dimensions. This foundational gap in AI's capabilities signifies a critical unmet need, with the potential to reshape fields from robotics to creative expression.

The limitations of current AI, dominated by language processing, stem from an incomplete understanding of intelligence itself. Human and animal intelligence are deeply rooted in spatial reasoning, developed over millions of years of navigating and interacting with the physical world. Language, a relatively recent evolutionary development, is a lossy and often inadequate tool for capturing the complexity of 3D space. This is evident in tasks where precise spatial manipulation is required; descriptive language alone cannot convey the necessary information for effective action, whereas direct perception of the 3D environment enables such capabilities. The substantial investments in areas like autonomous vehicles, which tackle 2D navigation, highlight the difficulty and importance of spatial problems, suggesting that breakthroughs in this domain are essential for advancing AI. The success of large language models (LLMs) in their domain, however, provides an inspiration and a blueprint for the potential impact of similarly focused efforts on world models.

The implications of developing robust world models are far-reaching, enabling a new generation of AI applications. In creativity, spatial intelligence can revolutionize design, architecture, and entertainment by allowing for more intuitive and perceptually grounded content generation. For robotics, it is a prerequisite for machines to effectively navigate, manipulate objects, and collaborate with humans in physical spaces, extending beyond mere humanoid forms to encompass a wide range of automated systems. Perhaps most transformatively, the fusion of generative AI with spatial understanding opens the door to creating and experiencing an infinite multiverse of virtual worlds for socialization, travel, and storytelling. This horizontal applicability means that advancements in world models will impact nearly every facet of human work and life, moving beyond abstract concepts to concrete applications like detailed 3D reconstruction from 2D views, enabling manipulation, measurement, and even the generation of entirely new environments, from video games to sophisticated design tools.

The transition to 3D spatial intelligence is a necessary evolution because the fundamental principles of physics and interaction occur in three dimensions. While humans can mentally reconstruct 3D space from 2D input due to innate biological capabilities, AI systems require explicit 3D representation to perform spatial tasks accurately. This is crucial for applications like robotics, where precise distance, depth, and object manipulation are paramount. The development of technologies like Neural Radiance Fields (NeRFs) and Gaussian splat representations, alongside foundational work in image generation, demonstrates academic progress in this area. However, World Labs aims to consolidate these advancements through a dedicated, industry-grade effort, bringing together experts in AI, computer vision, graphics, and optimization to productize this critical capability. The initiative recognizes that solving the 3D spatial problem is not merely an incremental improvement but a fundamental shift necessary for AI to achieve true general intelligence and unlock its full potential across myriad applications.

Action Items

Audit AI systems: Assess 3-5 core models for spatial intelligence capabilities beyond language processing.
Create 3D reconstruction framework: Develop a system to generate full 3D representations from 2D views for 5-10 applications.
Measure spatial intelligence impact: Track improvements in robotics or creative tasks by quantifying 3D manipulation accuracy.
Design AI training data: Curate datasets that emphasize physical world interaction and spatial reasoning for 3-5 AI agents.

Key Quotes

"the space the 3d space the space out there the space in your mind's eye the spatial intelligence that enables people to do so many things that's beyond language is a critical part of intelligence"

Fei-Fei Li argues that spatial intelligence, which allows for understanding and interacting with the 3D world, is a fundamental aspect of intelligence that extends beyond language capabilities. This highlights her core belief that current AI is missing a crucial component by focusing primarily on linguistic understanding.

"but what if we're missing something more fundamental not words but space the physical world we move through and shape my guests today think we are"

This statement introduces the central thesis of the podcast episode, suggesting that the dominant focus on language models (LLMs) in AI might be overlooking a more essential element: the understanding and interaction with the physical, spatial world. It sets up the discussion around the importance of "world models."

"feifei really singularly brought in data to the equation which now we're recognizing is actually probably the bigger problem the more interesting one and so she truly is the godmother of ai as everybody calls her"

Martin Casado highlights Fei-Fei Li's pivotal contribution to AI by emphasizing her role in bringing data to the forefront of machine learning research. He suggests that the significance of data in AI development is now widely acknowledged, solidifying her reputation as a foundational figure in the field.

"i was also particularly looking for an intellectual partner because what we are doing at world labs is very deep tech we are trying to do something no one else has done we know with a lot of conviction it will change the world literally but i need someone who is a computer scientist who's a student of ai understand product market go to market go to market and just can be on the phone or in person with me every moment of the day as an intellectual partner"

Fei-Fei Li explains her criteria for selecting Martin Casado as an investor, emphasizing the need for an intellectual partner due to the deeply technical and novel nature of World Labs' mission. She sought someone with a strong computer science and AI background who could provide both strategic insight and constant collaboration.

"and so feifei if i'm in the end of this table all these people talking about it and feifei leans over to me she's like you know what we're missing i said what are we missing she said we're missing a world model and i'm like yes and it fell into place then because i'd been like thinking about stuff at a high level but as she does she just kind of perfectly articulated this"

Martin Casado recounts the moment of alignment with Fei-Fei Li regarding the concept of "world models," contrasting it with the prevailing LLM discussions. He describes how Li's precise articulation of this missing piece resonated with his own developing intuition, leading to a shared vision.

"language is an incredibly powerful encoding of thoughts and information but it's actually not a powerful encoding of what the 3d physical world that all animals and living things live in and if you look at human intelligence so much is beyond the realm of language language is a lossy way to capture the world"

Fei-Fei Li elaborates on the limitations of language as a sole means of understanding intelligence, arguing that it is insufficient for encoding the complexities of the 3D physical world. She posits that language is a "lossy" representation, implying that much of reality is not captured by words alone.

"one way to think about it is we do a lot of language processing and we use that to communicate and the high level ideas etc but when it comes to navigating the actual world we really really rely on the world itself and our ability to reconstruct that"

Martin Casado uses an analogy to differentiate between language processing and spatial understanding, suggesting that while language is useful for high-level communication, navigating the real world relies heavily on direct perception and reconstruction of the environment. This emphasizes the practical necessity of spatial intelligence.

"but fundamentally physics happens in 3d and interaction happens in 3d navigating behind the back of the table needs to happen in 3d composing the world whether physically digitally needs to happen in 3d so fundamentally the problem is a 3d problem"

Fei-Fei Li asserts that the fundamental nature of physical interactions and the world itself is 3D. She argues that any AI system aiming to truly understand and interact with reality must operate within this 3D framework, as physics and spatial composition are inherently three-dimensional.

"and so for many things that are spatial you need to provide that information to the computer so that you can actually navigate in 3d space and so 2d video is great if it's a human because we already can turn it into 3d but like for any computer program it'll need to be 3d"

Martin Casado explains why 3D representation is crucial for AI systems, particularly robots, stating that 2D video alone lacks the necessary depth information for accurate spatial navigation. He contrasts this with human perception, where the brain can infer 3D from 2D, highlighting the distinct requirements for computational systems.

"but that's what makes the digital virtual world incredible with this technology which we should talk about it's the combination of generation and reconstruction suddenly we can actually create infinite universes some are for robots some are for creativity some are for socialization some are for travel some are for storytelling it suddenly will enable us to live in a multiverse way the imagination is boundless"

Fei-Fei Li envisions a future where the combination of generative AI and reconstruction capabilities allows for the creation of limitless digital universes. She suggests this technological advancement will enable a "multiverse way" of living, expanding human imagination and application across various domains.

Resources

External Resources

Books

"The Double Helix" by James Watson - Mentioned as an example of a profound scientific finding that relies on understanding 3D space.

Research & Studies

Neural Radiance Field (NeRF) (Berkeley) - Mentioned as a revolutionary method for 3D reconstruction using deep learning.
Gaussian Splat Representation - Referenced as a popular method for representing volumetric 3D.

People

Fei-Fei Li - Cofounder and CEO of World Labs, pioneer in modern AI, and proponent of world models.
Martin Casado - a16z General Partner, early investor in World Labs, and intellectual partner to Fei-Fei Li.
Erik Torenberg - a16z General Partner, moderator of the conversation.
Nick McKeown - Martin Casado's PhD advisor at Stanford.
Sebastian Thrun - Mentioned for winning the DARPA Grand Challenge in 2006, related to autonomous vehicles.
Ben Mildenhall - Cofounder of World Labs, associated with Neural Radiance Field (NeRF).
Christoph Lasner - Cofounder of World Labs, associated with Gaussian Splat representation.
Justin Johnson - Cofounder of World Labs, former student of Fei-Fei Li, foundational work in image generation.

Organizations & Institutions

World Labs - Company founded by Fei-Fei Li and others to build world models.
a16z (Andreessen Horowitz) - Venture capital firm, mentioned for its podcast and investments.
Stanford - University where Fei-Fei Li was an assistant professor and Martin Casado completed his PhD.
Twitter - Mentioned as a board Fei-Fei Li served on.
Google - Mentioned as a company where Fei-Fei Li was an executive.
DARPA - Mentioned for the Grand Challenge related to autonomous vehicles.
Berkeley - University associated with the development of NeRF.

Other Resources

World Models - AI systems that can understand and reason about the 3D, physical world.
LLMs (Large Language Models) - AI systems primarily focused on language generation and understanding.
Spatial Intelligence - The ability to understand and reason about 3D space, considered a fundamental part of intelligence.
Embodied Intelligence - AI that can perceive and act in the physical world.
Foundation Models - AI models that can be adapted to a wide range of tasks.
Generative AI - AI capable of creating new content, such as text, images, or music.
Multiverse - The concept of infinite universes, enabled by world models for various applications.
DNA (Deoxyribonucleic Acid) - Mentioned for its double helix structure as an example of a 3D concept.
Buckyball (Carbon Molecule) - Mentioned as an example of a profoundly constructed 3D world structure.
Autonomous Vehicles (AVs) - Self-driving cars, discussed in the context of the difficulty of spatial navigation.
GANs (Generative Adversarial Networks) - A type of AI used for image generation before transformers became prominent.
Transformers - A type of neural network architecture, prominent in LLMs.
Stereo Vision - The ability to perceive depth and distance using two eyes, crucial for 3D understanding.