Open-Source Inference Layer Addresses Dynamic LLM Workloads
The quiet revolution in AI isn't about smarter models, but about the unseen infrastructure that makes them usable. This conversation with Simon Mo and Rusa Quan, founders of Inferact and creators of the open-source VLM inference engine, reveals how the very act of running AI models has become a monumental systems challenge, eclipsing the perceived difficulty of training them. The non-obvious implication? The future of AI's accessibility and efficiency hinges not on algorithmic breakthroughs, but on mastering the chaotic, unpredictable demands of inference. Engineers and product leaders who grasp the systemic nature of inference infrastructure gain a critical advantage in building reliable, scalable AI applications, while those clinging to traditional computing paradigms risk being left behind.
The Unseen Bottleneck: Why Inference is the New Frontier
The narrative of AI progress is dominated by headlines about model breakthroughs and ever-increasing parameter counts. Yet, beneath this visible layer of innovation lies a far more intricate and pressing problem: making these powerful models actually run. As Simon Mo and Rusa Quan explain, the transition from traditional computing, where the hard part was building a system, to the era of large language models (LLMs) has fundamentally inverted this dynamic. LLMs, with their unpredictable inputs and outputs, defy the clockwork regularity of older systems. This inherent dynamism, where a prompt can be a single word or an entire book, and an output can be instantaneous or stretch indefinitely, places unprecedented demands on hardware.
This unpredictability is the crux of why inference has become so challenging. Traditional machine learning workloads, like image processing, often involve resizing and batching inputs into regular, static tensors. This predictable structure makes efficient processing on GPUs straightforward. LLMs, however, are fundamentally different. Their auto-regressive nature means each generated token depends on the previous ones, creating a continuous, flowing stream of computation where the length and timing of each request are unknown. This requires a sophisticated approach to scheduling and memory management, areas where VLM has made significant strides.
"The problem we're solving before the serving system is about just what we call micro-batching... But the change in the LLM world is you always have requests that are continuously flowing and coming in, and then each request looks differently. You just need to normalize. So that's why you have to have a notion of a step within the LLM engine to process one token across all the requests at the same time, regardless of each request having different kinds of input lengths and output lengths."
The complexity is further amplified by the sheer scale and diversity of modern AI. Mo and Quan highlight that VLM is currently running on an estimated 400,000 to 500,000 GPUs globally, a testament to its adoption. This scale, however, introduces its own set of challenges. Distributing massive models across multiple GPUs and nodes requires careful consideration of sharding strategies, balancing communication overhead with load balancing to maintain performance. Moreover, the AI landscape is characterized by a dizzying array of model architectures--from sparse attention to linear attention--and hardware types, including various Nvidia, AMD, and Intel chips. VLM's success lies in its ability to abstract this complexity, providing a common runtime that supports this burgeoning diversity. This universality is not just a technical feat; it's a strategic imperative, enabling model providers, silicon manufacturers, and infrastructure builders to collaborate on a shared foundation.
The Agentic Future: When AI Becomes Unpredictable by Design
The evolution of AI from single-turn tools to multi-turn agents introduces a new layer of complexity that challenges even the most advanced inference systems. As AI agents interact with external tools, perform web searches, or run code, the concept of a "finished" request becomes blurred. This uncertainty directly impacts how inference systems manage state, particularly the KV cache, a critical component for transformer models. In traditional inference, the system knows when to clear the cache once a response is complete. However, with agents, the duration of these external interactions--seconds, minutes, or even hours--is unknown.
"In agentic use cases, you actually don't know whether or not the agent will think it finishes or also wait. The interaction previously was just a human typing in the text box. But now it becomes external environment interaction. It could be one second just for a single script to finish. It could be 10 seconds for search or like a complex analysis to finish. And then it could also be minutes, hours if there's humans in the loop."
This unpredictability disrupts the uniform access and eviction patterns of traditional caches, forcing a re-evaluation of memory management strategies. The problem of cache invalidation, a notoriously difficult challenge in computer science, becomes even more pronounced. This shift towards agentic AI necessitates a co-optimization of agentic and inference architectures, moving beyond simple request-response paradigms to accommodate long, iterative processes involving both the LLM and its environment. Inferact's vision for a "universal inference layer" aims to address this by building a runtime that can handle not just the inference of any model on any hardware, but also the complex state management required for intelligent agents. This forward-looking approach positions Inferact to capitalize on the emerging needs of AI systems that are designed to be unpredictable.
The Open-Source Advantage: Building a Universal Standard
The founders of Inferact are staunch proponents of open-source AI, viewing it as the most effective mechanism for fostering the diversity and innovation required to navigate the complex AI landscape. They argue that proprietary, closed-source models, while optimized for specific use cases, cannot match the adaptability and breadth offered by an open ecosystem. The AI world, they contend, is too complex for a single entity to dictate the trend. Instead, a common standard, built through open collaboration, allows model providers, hardware manufacturers, and application developers to iterate and build upon a shared foundation.
This philosophy extends to their company, Inferact. Their primary mission is to support, maintain, and advance the open-source VLM ecosystem. They believe that VLM becoming the de facto standard for AI inference is not just beneficial for the community but is the very reason their company exists. This commitment to open source provides a significant strategic advantage, allowing them to leverage the collective effort of a global community of contributors, which they note is often faster and more innovative than any single corporate team.
"It is only that VLM win, VLM becomes the standard, and VLM help everybody to achieve what they need to do, then our company, in the sense, have the right meaning and to be able to support everybody around it."
The success of VLM, with over 2,000 contributors and widespread adoption by major entities like Amazon for its Rufus assistant bot, underscores the power of this open-source approach. Companies like Catcher AI have even deployed unmerged, experimental features from VLM at scale, demonstrating the community's eagerness to adopt cutting-edge advancements. Inferact's strategy is to build a horizontal abstraction layer for accelerated computing, akin to operating systems for CPUs or databases for storage, but specifically for inference. This universal inference layer will abstract away the complexities of GPUs and other specialized hardware, enabling developers to focus on building AI applications rather than wrestling with underlying infrastructure. This focus on a fundamental, widely applicable software layer, rather than a vertical slice of a specific product, is where they see the greatest long-term impact and competitive advantage.
Key Action Items
- Immediately: Begin mapping the inference requirements of your current and future AI applications. Understand the dynamic nature of LLM inputs/outputs and how they differ from traditional workloads.
- Within the next quarter: Evaluate your current inference infrastructure. Identify potential bottlenecks related to scheduling, memory management, and hardware utilization. Consider adopting or contributing to open-source inference engines like VLM.
- Over the next 6-12 months: Investigate how agentic AI workloads will impact your inference needs. Begin exploring how to manage state and unpredictable interaction times within your inference pipelines.
- This year: Explore opportunities to standardize your AI model and hardware stack by leveraging open-source inference solutions. This reduces vendor lock-in and increases flexibility.
- Ongoing investment: Foster a culture of systems thinking within your AI teams. Encourage engineers to look beyond immediate model performance to the downstream consequences of inference infrastructure choices.
- This pays off in 12-18 months: Prioritize building or adopting inference solutions that are designed for diversity--supporting multiple model architectures and hardware types. This future-proofs your AI deployments.
- Requires patience, creates advantage: Actively participate in or monitor the development of open-source inference projects. Contributing to or adopting these projects early can provide deep insights and early access to critical advancements, creating a significant competitive moat.