Open-Source Inference Layer Standardizes Diverse AI Models and Hardware

Original Title: Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show · January 22, 2026 · Listen to Original Episode →

The Hidden Engine of AI: Why Inference, Not Just Training, Is the Next Frontier

The prevailing narrative of AI progress often centers on the dazzling breakthroughs in model creation -- bigger, smarter, more capable models. Yet, beneath this visible layer of innovation lies a far more complex and critical systems challenge: inference. This conversation with Simon Mo and Rusa Quan, co-founders of Inferact and creators of the open-source inference engine VLM, reveals that the true bottleneck in deploying AI isn't just building models, but making them run efficiently, reliably, and affordably in the chaotic real world. The non-obvious implication is that the infrastructure supporting AI may ultimately be more impactful than the models themselves. Anyone building or deploying AI, from individual developers to large enterprises, needs to understand this hidden layer. Mastering inference infrastructure offers a significant competitive advantage by unlocking the practical, scalable deployment of AI capabilities, a feat that conventional approaches struggle to achieve.

The Unseen Friction: Why AI Inference Is a Systemic Nightmare

The journey from a trained AI model to a deployed, responsive application is fraught with complexities that defy traditional computing paradigms. While the public eye focuses on the marvel of model training, the act of running these models -- inference -- has quietly become one of the most intractable problems in modern AI. This isn't just about raw computational power; it's about managing an inherently unpredictable workload that strains hardware and software designed for more predictable tasks.

The core of the issue, as explained by Mo and Quan, lies in the fundamental difference between traditional machine learning workloads and the auto-regressive nature of large language models (LLMs). For years, ML inference involved tasks like image recognition where inputs could be standardized -- images resized or cropped to a uniform tensor. This regularity allowed for efficient batching and processing on GPUs. LLMs, however, shatter this predictability.

"Your prompt can be either like 'hello,' a single word, or your prompt can be a bunch of documents spanning hundreds of pages. And this kind of dynamism exists inherently in the language model, and this makes things a whole kind of into a different world. We have to handle this dynamism as a first-class citizen."

This inherent dynamism means that each request can vary wildly in length and complexity. Furthermore, the output itself is non-deterministic; the model decides when to stop generating tokens. This stochastic, continuous flow contrasts sharply with the clockwork predictability of older ML systems. The immediate consequence is a massive challenge in scheduling requests and managing memory, particularly the crucial KV cache, which stores intermediate states for faster generation. VLM's initial breakthrough, page attention, directly addressed this by offering a more efficient way to handle the memory demands of these dynamic sequences, a problem that traditional batching methods simply couldn't manage.

The Compounding Challenges: Scale, Diversity, and the Dawn of Agents

The difficulty of inference is not a static problem; it’s an escalating one. Mo and Quan highlight three key trends that are making inference progressively harder: scale, diversity, and the emergence of AI agents.

First, the sheer scale of models is exploding. With models boasting trillions of parameters, deploying them requires distributing them across multiple GPUs and nodes. This introduces intricate challenges in determining the optimal way to shard these massive models, balancing communication overhead with load balancing to avoid performance bottlenecks. As models grow, so does the complexity of managing their distribution.

Second, diversity is rampant, both in hardware and model architectures. The AI ecosystem is a mosaic of different GPU architectures (Nvidia, AMD, Intel, TPUs) and increasingly divergent model designs. While early LLMs might have converged on similar transformer architectures, newer models are exploring innovations like sparse attention and linear attention. Each of these architectural nuances requires specialized optimization within the inference engine. The VLM team, by fostering a broad community and collaborating with model vendors, aims to create a universal layer that can adapt to this constant flux, rather than forcing users to maintain bespoke stacks for every new model or chip. This diversity, while powerful, means a one-size-fits-all inference solution is increasingly untenable.

The third and perhaps most profound challenge is the rise of agents. AI systems are moving beyond simple text-in, text-out paradigms to become interactive agents that can use tools, perform searches, and engage in complex, multi-turn dialogues. This fundamentally alters the state management problem.

"As before, the paradigm has been text in, text out, and then just single request response. And then, but as we evolve into the year and the decade of agents, we're seeing multi-turn conversation turning into hundreds and thousands of turns. And then these turns also involves external tool use..."

This shift disrupts traditional caching strategies. In agentic workflows, the duration of a "turn" can be highly variable, ranging from milliseconds for a script execution to hours if human input is involved. This uncertainty makes it difficult to determine when a conversation is truly over and when associated KV cache can be safely evicted. The implication is that inference systems will need to become far more sophisticated in managing state and predicting request completion, a problem that, as Mo humorously notes, touches on the “unsolvable problems in computer science, cache validation.”

The Open Source Advantage: Building a Universal Standard

Faced with these escalating challenges, Mo and Quan are staunch advocates for open source. They argue that the complexity of AI infrastructure necessitates a diversity of approaches that proprietary, closed-source systems cannot accommodate.

"We're definitely big believers in open source. What we believe is diversity will triumph that sort of single of anything at all. So that means we believe in diversities in models, diversity in chip architecture. Fundamentally, this is because the world is complex."

They draw parallels to the foundational systems of computing -- operating systems, databases, cluster managers -- which all thrived when common standards emerged, allowing innovation to build upon a shared base. For enterprises, this means that relying on closed-source, vertically integrated solutions limits their ability to fine-tune models for their specific hardware and use cases. An open-source inference layer, like VLM, provides a common ground where model providers, hardware manufacturers, and application developers can all contribute and benefit. This collaborative ecosystem, they contend, has an execution capability far beyond any single entity. The rapid pace of innovation in open source, particularly VLM, means that even large companies struggle to keep up with its advancements, making adoption of such projects a strategic necessity for staying competitive.

From Community to Company: The Genesis of Inferact

The success of VLM, which has grown from a small research project to an open-source juggernaut with hundreds of thousands of GPUs running it daily, naturally led to the formation of Inferact. The company's mission is to accelerate the development and adoption of VLM, positioning it as the universal inference engine for AI.

"It is only that VLM win, VLM becomes the standard, and VLM help everybody to achieve what they need to do, then our company, in the sense, have the right meaning and to be able to support everybody around it. So open source is definitely number one, and in fact, something the only priority of our company right now."

Inferact’s strategy is to heavily invest in the open-source project, ensuring its continued evolution and robustness. This commitment is not just altruistic; it's a strategic advantage. By stewarding the open-source ecosystem, Inferact gains unparalleled insight into the cutting edge of inference technology and builds a foundation that benefits all participants, including their own commercial offerings. Their focus is on building a horizontal abstraction layer for accelerated computing, akin to how operating systems abstract CPUs and memory. This layer will abstract away the complexities of GPUs and other accelerators, enabling developers to focus on building AI applications rather than wrestling with the intricacies of hardware optimization. This universal inference layer is seen as essential for unlocking the next generation of AI applications, particularly those involving agents and complex interactions.

Key Action Items

Immediate Action (Next 1-3 Months):
- Evaluate VLM for Current Deployments: Assess if VLM can replace or augment existing inference solutions to improve efficiency and reduce costs.
- Explore VLM Community Resources: Engage with VLM documentation, forums, and GitHub to understand its capabilities and best practices.
- Benchmark VLM Performance: Conduct small-scale performance tests with representative models and hardware to quantify potential gains.
Short-Term Investment (Next 3-6 Months):
- Integrate VLM for New Projects: Prioritize VLM for any new AI model deployment to leverage its advanced features from the outset.
- Train Key Personnel: Invest in training ML infrastructure engineers on VLM's architecture and optimization techniques.
- Contribute to VLM (Optional but Recommended): Identify small bug fixes or documentation improvements to contribute back to the open-source project, fostering deeper understanding and community engagement.
Long-Term Investment (6-18+ Months):
- Develop Custom Optimizations: For critical workloads, consider investing in specialized kernel development or contributions to VLM to tailor performance for specific hardware or model architectures.
- Monitor Agentic Workflow Requirements: Proactively research and plan for the inference infrastructure needs of future AI agent applications, anticipating increased state management complexity.
- Strategic Partnership with Inferact: Evaluate potential partnerships or commercial support from Inferact to gain dedicated expertise and accelerate adoption of VLM as a universal inference layer.
- Advocate for Open Standards: Champion open-source inference standards within your organization and industry to foster interoperability and avoid vendor lock-in.

Related Episodes

Open-Source Inference Layer Addresses Dynamic LLM Workloads

Jan 22, 2026 AI + a16z

AI inference complexity is exploding due to dynamic LLMs and diverse hardware, demanding a new, open-source universal inference layer for efficient, adaptable AI deployment.

View Episode Notes →

AI's True Value Lies in Engineering, Not Just Models

Feb 17, 2026 AI + a16z

AI's true advantage isn't smarter models, but the robust engineering that surrounds them. Master feedback loops and evaluation to build lasting value when brute force hits its limits.

View Episode Notes →

Extreme Co-Design Drives Specialized AI Systems Through Hardware-Software Integration

Mar 10, 2026 The Stack Overflow Podcast

True AI innovation hinges on extreme co-design, where hardware and software evolve together. Discover how this approach is shaping specialized, universally applicable AI systems.

View Episode Notes →

AI Disrupts Compounding Leads, Redefining Market Leadership and Strategy

Feb 02, 2026 The a16z Show

AI upends traditional competitive advantage, making innovation and market leadership more volatile. Understand how speed, scale, and adaptation redefine success in this new landscape.

View Episode Notes →

Accelerated Computing Powers AI's Essential Infrastructure and Job Evolution

Jan 08, 2026 No Priors: Artificial Intelligence | Technology | Startups

Accelerated computing, driven by the end of Moore's Law, makes AI infrastructure essential, not a bubble. AI augments jobs, fosters innovation via open source, and robotics will solve labor shortages.

View Episode Notes →

Enterprise AI Agent Adoption Requires Reinvented Access Control

Jan 08, 2026 The a16z Show

AI agents will transform enterprises in 2026, driving efficiency but demanding a radical reinvention of access control beyond traditional security models.

View Episode Notes →