Cost Per Token Drives AI Value Beyond Compute Metrics
The true cost of AI isn't measured in FLOPS, but in tokens. This conversation with Shruti Koparkar from NVIDIA reveals that optimizing for cost-per-token, a metric that directly reflects business value, is the critical differentiator for AI success. The implications are profound: focusing solely on input metrics like FLOPS per dollar blinds businesses to the downstream efficiencies and new possibilities unlocked by architectural advancements like NVIDIA Blackwell. This analysis is crucial for any business leader aiming to navigate the evolving AI landscape, particularly those building AI-native products or seeking to enhance existing operations, as it highlights how strategic infrastructure choices translate directly into competitive advantage and profitability. Understanding the four pillars of tokenomics--utility, supply, demand, and monetization--is no longer optional; it's the blueprint for leveraging AI as a true industrial revolution engine.
The Hidden Price of "Cheap" Compute
The prevailing wisdom in AI infrastructure often fixates on input metrics: FLOPS per dollar, cost per GPU hour. These are tangible, easily quantifiable numbers that speak to the immediate cost of hardware. However, Shruti Koparkar argues this is a fundamental misdirection, akin to measuring the value of a factory by the cost of its raw materials rather than the price of its finished goods. The true economic engine of AI, she explains, is the token, and the critical metric for evaluating infrastructure is cost per token. This isn't just a semantic shift; it's a complete reorientation of how businesses should think about their AI investments.
When businesses chase metrics like FLOPS per dollar, they overlook the complex interplay of factors that determine actual output. NVIDIA Blackwell, for instance, offers a 2x advantage in FLOPS per dollar over its predecessor, Hopper. On the surface, this seems like a straightforward improvement. But Koparkar reveals the deeper truth: Blackwell delivers 50x more tokens per watt than Hopper. This translates to a staggering 35x reduction in token cost. This dramatic difference isn't achieved through incremental hardware upgrades alone; it's the result of "extreme co-design." This process involves simultaneously optimizing compute, memory, storage, networking, and software from the ground up, all with the singular goal of minimizing token cost.
"The metric that represents both your input but also the output is cost per token. It's a very simple metric that tells you what is the cost that you are paying for generating one token. It's essentially the cost of the GPU divided by how many tokens does the GPU produce. So, in a way, it incorporates both the input and the output and gives you a sense of your true ROI from the AI infrastructure."
-- Shruti Koparkar
This emphasis on co-design extends beyond the silicon. It encompasses the entire software stack, from CUDA kernels to inference runtimes like TensorRT and vLLM. Software, Koparkar stresses, is the bridge between spec-sheet numbers and real-world performance. Optimizations like quantization, speculative decoding, and disaggregated serving, when stacked together, unlock the full potential of the hardware, driving throughput and drastically lowering token costs. The rapid evolution of open-source software, with runtimes like vLLM and SGLang seeing 8x performance gains in just six months, further amplifies this advantage, demonstrating how a collaborative ecosystem continuously drives down the cost of intelligence.
The Unseen Demand Multipliers
Understanding token demand is more than just counting users and requests. Koparkar outlines several "multipliers" that dramatically inflate actual token requirements, often catching organizations off guard. The first is the use of reasoning models, which consume "thinking tokens" that are never seen by the end-user. While thresholds can be set, estimating peak and average usage of these internal computation tokens is crucial.
Even more significant is the impact of agentic applications. Unlike conversational AI, where a user exchanges turns with the AI, agentic workflows involve the AI itself taking multiple turns with other AIs, specialized agents, or software tools. This intricate dance, orchestrated to fulfill a single user prompt, can exponentially increase the number of times a large language model is called, and consequently, the token demand. The Vera Rubin platform, designed for agentic AI, exemplifies the need for extreme co-design to handle this complexity, optimizing LLM reasoning, tool calling, and memory management (KV cache offloading) to minimize latency and cost in these multi-turn interactions.
Another critical factor is the KV cache hit rate. This short-term memory of a model stores previously processed inputs, allowing for faster responses by avoiding recomputation. While seemingly a technical detail, efficient KV cache management, such as offloading it when not immediately needed, directly impacts token generation speed and cost. Ignoring these demand multipliers leads to under-provisioning and inflated costs, a common pitfall for businesses focused solely on initial hardware acquisition.
The Paradox of Efficiency: More Tokens, More Demand
The pursuit of lower token costs, driven by architectural innovation and software optimization, leads to an inevitable consequence: the Jevons paradox. This economic principle suggests that increasing the efficiency of resource use tends to increase, rather than decrease, the consumption of that resource. In the context of AI, as the cost per token plummets, new use cases emerge, and existing ones become more sophisticated, leading to an overall increase in GPU demand.
Koparkar illustrates this with historical trends. The advent of generative AI initially focused on summaries and images. As token costs decreased, researchers discovered the value of test-time scaling and reasoning, leading to better, more accurate responses and, consequently, more GPU demand. Now, with the rise of agentic AI and efficient deployment of complex models, we are seeing another inflection point. The ability to execute intricate, multi-turn workflows at a lower token cost is unlocking new applications and driving further demand for AI infrastructure. This cyclical pattern underscores a fundamental truth: AI is not a solved problem with a fixed demand; it's a continually expanding frontier where efficiency breeds further innovation and consumption.
"People aren't going to run away from intelligence. They want to use it."
-- Shruti Koparkar
This phenomenon has significant implications for long-term infrastructure planning. Businesses that successfully reduce their cost per token are not setting themselves up for reduced hardware needs; they are positioning themselves to be at the forefront of the next wave of AI innovation, capturing value from applications that were previously economically unfeasible.
Monetizing Intelligence: Beyond Selling Tokens
While selling tokens directly is a viable business model, Koparkar outlines a broader framework for turning AI output into business value. The four primary models are:
- Selling Tokens Directly: Companies like Firework and Together AI offer direct access to AI-generated tokens, enabling their customers to build services on top of this foundation.
- AI-Native Companies: These businesses, such as Perplexity and Cursor, build their entire product offering around AI from the ground up, embedding intelligence into every facet of their user experience.
- Enhancing Existing Products: Established companies like Shopify, Airbnb, and Adobe integrate AI to improve their current offerings, adding new capabilities or streamlining existing features. Adobe's Firefly models, for instance, are infused into Photoshop.
- Improving Internal Operations: Nearly every organization is leveraging AI to boost internal productivity, optimize processes, and enhance employee workflows, even if these applications are not customer-facing.
Successfully monetizing AI requires a strategic approach to pricing. This involves understanding the cost to produce a token, which NVIDIA's co-design efforts aim to minimize, and balancing it with value-based pricing, reflecting the willingness to pay for the intelligence and interactivity a token provides. Furthermore, analyzing demand distribution--where demand is concentrated for high-utility tokens versus more basic ones--is crucial for setting profitable price points and meeting revenue goals. Ultimately, the most effective strategy starts with the customer need, working backward through use case definition, infrastructure requirements, and finally, to the monetization model.
- Define the customer need: Start by identifying the core problem or opportunity, whether for external customers or internal processes.
- Map use cases to token utility: Determine the required intelligence (model complexity, context length) and interactivity (tokens per second) for each use case.
- Select infrastructure based on cost per token: Evaluate hardware and software solutions not just on input metrics but on their ability to deliver tokens at the lowest possible cost.
- Estimate token demand with multipliers: Account for reasoning models, agentic workflows, and KV cache efficiency to accurately forecast token consumption.
- Develop a monetization strategy: Determine pricing based on production cost, value delivered, and demand distribution, considering direct token sales, AI-native products, enhancements, or internal optimizations.
- Iterate and optimize: Continuously refine models, software, and infrastructure as new use cases emerge and the AI landscape evolves.
- Embrace the Jevons Paradox: Recognize that increased efficiency will likely lead to greater demand, positioning your organization to capitalize on this expansion.